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There exist numerous proofs of Bell's theorem, stating that quantum mechanics is incompatible with local realistic theories of 
nature. Here we define the strength of such nonlocality proofs in terms of the amount of evidence against local realism provided 
by the corresponding experiments. This measure tells us how many trials of the experiment we should perform in order to observe 
a substantial violation of local realism. Statistical considerations show that the amount of evidence should be measured by the 
Kullback-Leibler or relative entropy divergence between the probability distributions over the measurement outcomes that the 
respective theories predict. The statistical strength of a nonlocality proof is thus determined by the experimental implementation 
of it that maximizes the Kullback-Leibler divergence from experimental (quantum mechanical) truth to the set of all possible local 
theories. An implementation includes a specification with which probabilities the different measurement settings are sampled, and 
hence the maximization is done over all such setting distributions. 

We analyze two versions of Bell's nonlocality proof (his original proof and an optimized version by Peres), and proofs by 
Clauser-Horne-Shimony-Holt, Hardy, Mermin, and Greenberger-Horne-Zeilinger. We find that the GHZ proof is at least four and 
a half times stronger than all other proofs, while of the two-party proofs, the one of CHSH is the strongest. 
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I. Introduction 

A PLETHORA of proofs exist of Bell's theorem ("quantum mechanics violates local realism") encapsulated in inequalities 
and equalities of which the most celebrated are those of Bell [5], Clauser, Home, Shimony and Holt (CHSH) [9], 
Greenberger, Home and Zeilinger (GHZ) [15], Hardy [20], and Mermin [25]. Competing claims exist that one proof is 
stronger than another. For instance, a proof in which quantum predictions having probabilities or 1 only are involved, is 
often said to be more strong than a proof that involves quantum predictions of probabilities between and 1 . Other researchers 
argue that one should compare the absolute differences between the probabilities that quantum mechanics predicts and those 
that are allowed by local theories. And so on. The aim of this paper is to settle such questions once and for all: we formally 
define the strength of a nonlocality proof and show that our definition is the only one compatible with generally accepted 
notions in information theory and theoretical statistics. 

To see the connection with statistics, note first that a mathematical nonlocality proof shows that the predicted probabilities 
of quantum theory are incompatible with local realism. Such a proof can be implemented as an experimental proof showing 
that physical reality conforms to those predictions and hence too is incompatible with local realism. We are interested in the 
strength of such experimental proofs, which should be measured in statistical terms: how sure do we become that a certain 
theory is false, after observing a certain violation from that theory, in a certain number of experiments. 

A. Our Game 

We analyze the statistics of nonlocality proofs in terms of a two-player game. The two players are the pro-quantum theory 
experimenter QM, and a pro-local realism theoretician LR. The experimenter QM is armed with a specific proof of Bell's 
theorem. A given proof — Bell, CHSH, Hardy, Mermin, GHZ — involves a collection of equalities and inequalities between 
various experimentally accessible probabilities. The proof specifies a given quantum state (of a collection of entangled qubits, 
for instance) and experimental settings (orientations of polarization filters or Stern-Gerlach devices). All local realistic theories 
of LR will obey the (in)equalities, while the observations that QM will make when performing the experiment (assuming that 
quantum mechanics is true) will violate these (in)equalities. The experimenter QM still has a choice of the probabilities with 
which the different combinations of settings will be applied, in a long sequence of independent trials. In other words, he must 
still decide how to allocate his resources over the different combinations of settings. At the same time, the local realist can 
come up with all kinds of different local realistic theories, predicting different probabilities for the outcomes given the settings. 
She might put forward different theories in response to different specific experiments. Thus the quantum experimenter will 
choose that probability distribution over his settings, for which the best local realistic model explains the data worst, when 
compared with the true (quantum mechanical) description. 

B. Quantifying Statistical Strength - Past Approaches 

How should we measure the statistical strength of a given experimental setup? In the past it was often simply said that 
the largest deviation in the Bell inequality is attained with such and such filter settings, and hence the experiment which is 
done with these settings gives (potentially) the strongest proof of nonlocality. The argument is however not very convincing. 
One should take account of the statistical variability in finite samples. The experiment that might confirm the largest absolute 
deviation from local realistic theories, might be subject to the largest standard errors, and therefore be less convincing than an 
experiment where a much smaller deviation can be proportionally much more accurately determined. 

Alternatively, the argument has just been that with a large enough sample size, even the smallest deviation between two 
theories can be made firm enough. For instance, [25] has said in the context of a particular example 

". . . to produce the conundrum it is necessary to run the experiment sufficiently many times to establish with 
overwhelming probability that the observed frequencies (which will be close to 25% and 75%) are not chance 
fluctuations away from expected frequencies of 33% and 66%. (A million runs is more than enough for this 
purpose). . . " 

We want to replace the words "sufficiently", "overwhelming", "more than enough" with something more scientific. (See 
Example[3]for our conclusion with respect to this.) And as experiments are carried out that are harder and harder to prepare, it 
becomes important to design them so that they give conclusive results with the smallest possible sample sizes. Initial work in 
this direction has been done by Peres [28]. Our approach is compatible with his, and extends it in a number of directions — see 
Section IVLCl 

C. Quantifying Statistical Strength - Our Approach 

We measure statistical strength using an information-theoretic quantification, namely the Kullback-Leibler (KL) divergence 
(also known as information deficiency or relative entropy [10]). We show (Appendix |^^} that for large samples, all reasonable 
definitions of statistical strength that can be found in the statistical and information-theoretic literature essentially coincide 
with our measure. For a given type of experiment, we consider the game in which the experimenter wants to maximize the 
divergence while the local theorist looks for theories that minimize it. The experimenter's game space is the collection of 
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probability distributions over joint settings, which we call in the sequel, for short, "setting distributions". (More properly, these 
are "joint setting distributions".) The local realist's game space is the space of local realistic theories. This game defines an 
experiment, such that each trial (assuming quantum mechanics is true) provides on average, the maximal support for quantum 
theory against the best explanation that local realism can provide, at that setting distribution. 

D. Our Results - Numerical 

We determined the statistical strength of five two-party proofs: Bell's original proof and Peres' optimized variant of it, and 
the proofs of CHSH, Hardy, and Mermin. Among these, CHSH turns out to be the strongest by far. We also determined 
the strength of the three-party GHZ proof. Contrary to what has sometimes been claimed (see Section fVll i. even the GHZ 
experiment has to be repeated a fair number of times before a substantial violation of local realism is likely to be observed. 
Nevertheless, it is about 4.5 times stronger than the CHSH experiment, meaning that, in order to observe the same support 
for QM and against LR, the CHSH experiment has to be run about 4.5 times as often as the GHZ experiment. 

E. Our Results - Mathematical 

To find the (joint) setting distribution that optimizes the strength of a nonlocality proof is a highly nontrivial computation. 
In the second part of this paper, we prove several mathematical properties of our notion of statistical strength. These provide 
insights in the relation between local realist and quantum distributions that are interesting in their own right. They also imply 
that determining statistical strength of a given nonlocality proof may be viewed as a convex optimization problem which can be 
solved numerically. We also provide a game-theoretic analysis involving minimax and maximin KL divergences. This analysis 
allows us to shortcut the computations in some important special cases. 

F. Organization of This Paper 

Section [H] gives a formal definition of what we mean by a nonlocality proof and the corresponding experiment, as well as 
the notation that we will use throughout the article. The kinds of nonlocality proofs that this article analyzes are described 
in Section HJIJ using the CHSH proof as a specific example; the other proofs are described in more detail in Appendices UTI 
and|in| The definition of the 'statistical strength of a nonlocality proof is presented in Section Hvl along with some standard 
facts about the Kullback-Leibler divergence and its role in hypothesis testing. With this definition, we are able to calculate 
the strengths of various nonlocality proofs. The results of these calculations for six well-known proofs are listed in Section IVl 
(with additional details again in Appendix Hill . The results are interpreted, discussed and compared in Section fVll which also 
contains four conjectures. Section lVTII constitutes the second, more mathematical part of the paper. It presents the mathematical 
results that allow us to compute statistical strength efficiently. 

We defer all issues that require knowledge of the mathematical aspects of quantum mechanics to the appendices. There we 
provide more detailed information about the nonlocality proofs we analyzed, the relation of Kullback-Leibler divergence to 
hypothesis testing, and the proofs of the theorems we present in the main text. 

Depending on their interests, readers might want to skip certain sections of this, admittedly, long paper. Only the first six 
sections are crucial; all other parts provide background information of some sort or the other. 

II. Formal Setup 

A basic nonlocality proof ("quantum mechanics violates local realism") has the following ingredients. There are two parties 
srf and S3, who can each dispose over one of two entangled qubits. They may each choose out of two different measurement 
settings. In each trial of the experiment, gtf and SS randomly sample from the four different joint settings and each observe 
one of two different binary outcomes, say "F" (false) and "T" (true). Quantum mechanics enables us to compute the joint 
probability distribution of the outcomes, as a function of the measurement settings and of the joint state of the two qubits. 
Thus possible design choices are: the state of the qubits, the values of the settings; and the probability distribution over the 
settings. More complicated experiments may involve more parties, more settings, and more outcomes. Such a generalized 
setting is formalized in Appendix U In the main text, we focus on the basic 2x2x2 case (which stands for '2 parties x 
2 measurement settings per party x 2 outcomes per measurement setting'). Below we introduce notation for all ingredients 
involved in nonlocality proofs. 

A. Distribution of Settings 

The random variable A denotes the measurement setting of party and the random variable B denotes the measurement 
setting of party S3. Both A and B take values in {1,2}. The experimenter QM will decide on the distribution a of (A,B), 
giving the probabilities (and, after many trials of the experiment, the frequencies) with which each (joint) measurement setting 
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is sampled. This setting distribution a is identified with its probability vector a := (ffn , CT12, 021 , 022) G E, and E is the unit 
simplex in K 4 defined by 

{(^11,^12,021,022) I Y, 0„,, = l,for all : C7 Qi > 0}. (1) 

a,6e{l,2} 

We use E uc to denote the set of vectors representing uncorrelated distributions in E. Formally, a G E uc if and only if 
o a b = (Oai +o a2 )((Tif, + CT2&) for all a,foe{l,2}. 

B. Measurement Outcomes 

The random variable X denotes the measurement outcome of party stf and the random variable Y denotes that of party 33. 
Both X and Y take values in {T,F}; F standing for 'false' and T standing for 'true'. Thus, the statement 'X = F,Y = T' and 
describes the event that party srf observed F and party 33 observed T. 

The distribution of (X,Y) depends on the chosen setting (a,b) G {1,2} 2 . The state of the entangled qubits together with the 
measurement settings determines four conditional distributions 2i 1 , Q12, G21 , G22 f° r C^,^)> one f° r each joint measurement 
setting, where Q a t, is the distribution of (X,Y) given that measurement setting (a,b) has been chosen. For example, Q a b(X = 
F,Y = T), abbreviated to Q a b(F,T), denotes the probability that party s£ observes F and party 33 observes T, given that the 
device of srf is in setting a and the device of 33 is in setting b. According to QM, the total outcome (X,Y,A,B) of a single 
trial is then distributed as Q a , defined by Q a (X = x,Y = y,A = a,B = b) := CT a i,Q a b(X = x,Y = y). 

C. Definition of a Nonlocality Proof and Corresponding Nonlocality Experiments 

A nonlocality proof for 2 parties, 2 measurement settings per party, and 2 outcomes per measurement, is identified with an 
entangled quantum state of two qubits (realized, by, e.g., two photons) and two measurement devices (e.g., polarization filters) 
which each can be used in one of two different measurement settings (polarization angles). Everything about the quantum state, 
the measurement devices, and their settings that is relevant for the probability distribution of outcomes of the corresponding 
experiment can be summarized by the four distributions Q a t, of (X,Y), one for each (joint) setting (a,b) G {1,2} 2 . Henceforth, 
we will simply identify a 2 x 2 x 2 nonlocality proof with the vector of distributions Q := (Q11, 612, 621, 622)- 

This definition can easily be extended in an entirely straightforward manner to a situation with more than two parties, two 
settings per party, or two outcomes per setting. In Appendix [I] we provide a formal definition of the general case where the 
numbers of parties, settings, and outcomes are arbitrary. 

We call a nonlocality proof Q ~ (Q\ 1,212,621,622) proper if and only if it violates local realism, i.e. if there exists no 
local realist distribution K (as defined below) such that P a b\n (•)=&*(') for all (a,b)e {1,2}- 

For the corresponding 2x2x2 nonlocality experiment we have to specify the setting distribution a with which the 
experimenter QM samples the different settings (a,b). Thus, for a single nonlocality proof Q, QM can use different experiments 
(different in a) to verify Nature's nonlocality. Each experiment consists of a series of trials, where — per trial — the event 
(X,Y,A,B) occurs with probability Q a {X =x,Y =y,A = a,B =b) = G a bQab{X =x,Y =y). 

D. Local Realist Theories 

The local realist (LR) may provide any 'local' theory she likes to explain the results of the experiments. 

Under such a theory it is possible to talk about "the outcome that srf would have observed, if she had used setting 1", 
independently of which setting was used by 33 and indeed of whether or not srf actually did use setting 1 or 2. Thus we 
have four binary random variables, which we will call X\, X2, Y\ and Fj. As before, variables named X correspond to g/'s 
observations, and variables named Y correspond to 33's observations. According to LR, each experiment determines values for 
the four random variables (X{,X2,Y\,Y2). For a G {1,2}, X a G {T, F} denotes the outcome that party si would have observed 
if the measurement setting of stf had been a. Similarly, for b G {1,2}, Y/, G {T,F} denotes the outcome that party 33 would 
have observed if the measurement setting of 33 had been b. 

A local theory n may be viewed as a probability distribution for (X[,X 2 ,yi,T 2 ). Formally, we define n as a 16-dimensional 
probability vector with indices (x\ ,X2,yi ,^2) S {T,F} 4 . By definition, P n {X\ =x\,X^ =X2,Y\ =yi,Y2 — 3^) := ^i^yiw F° r 
example, 7Tffff denotes LR's probability that, in all possible measurement settings, and 33 would both have observed F. 
The set of local theories can thus be identified with the unit simplex in ]R 16 , which we will denote by FL 

Recall that the quantum state of the entangled qubits determines four distributions over measurement outcomes Q a b{X =-,Y = 
•), one for each joint setting (a,b) G {1,2} 2 . Similarly, each LR theory % G Fl determines four distributions P a b^(X = -,Y = ■). 
These are the distributions, according to the local realist theory n, of the random variables (X,Y) given that setting (a,b) has 
been chosen. Thus, the value P a b-n(X = - ,Y — ■) is defined as the sum of four terms: 

PabAX =x,Y =y) := n *myiyi- ( 2 ) 

*i,*2,.yi,3'2e{T,F} 
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We suppose that LR does not dispute the actual setting distribution a which is used in the experiment, she only disputes the 
probability distributions of the measurement outcomes given the settings. According to LR therefore, the outcome of a single 
trial is distributed as P a -K defined by P a - K (X — x,Y = y,A = a 7 B = b) := (T a bP a b-x(X = x 7 Y — y). 

III. The Nonlocality Proofs 

In this section we briefly describe the five (or six, since we have two versions of Bell's proof) celebrated nonlocality proofs 
for which we will compute the statistical strength. In Appendix [HI] we provide further details about the entangled quantum 
states that give rise to the violations of the various inequalities. 

Let us interpret the measurement outcomes F and T in terms of Boolean logic, i.e. F is "false" and T is "true". We can 
then use Boolean expressions such as X2&Y2, which evaluates to true whenever both X2 and Y2 evaluate to 'true', i.e. when 
both X2 = T and Y 2 = T. We derive the proofs by applying the rule that if the event X = T implies the event Y = T (in short 
"X =>■ Y"), then Pr(X) <Pr(F). In similar vein, we will use rules like Pr(X VY) < Pr(X) +Pr(F) and 1 -Pr(-iX) -Pr(^T) < 
1 - Pr(-tf V ->Y) = Pv(X&Y). 

As an aside we want to mention that the proofs of Bell, CHSH and Hardy all contain the following argument, which can be 
traced back to the nineteenth century logician George Boole (1815-1864) [8]. Consider four events such that -^BD^CD 
-A, Then it follows that A BUCUD. And from this, it follows that Pr(A) < Pr(B) +Pr(C) +Pr(D). In the CHSH argument 
and the Bell argument, the events concern the equality or inequality of one of the X[ with one of the Yj. In the Hardy argument, 
the events concern the joint equality or inequality of one of the Xi, one of the Yj, and a specific value F or T. 

Example 1 (The CHSH Argument): For the CHSH argument one notes that the implication 

[(Xi = Yi)&{Xi = Y 2 )&{X 2 =Yi)} =>■ (X 2 = Y 2 ) (3) 

is logically true, and hence (X2 ^ Y2) => [{X\ ^ Y\) V (X\ ^ Y2) V (X2 ^ Y\)] holds. As a result, local realism implies the 
following "CHSH inequality" 

Pr(X 2 + Y 2 ) < Pr(Xi ^Yi) + Pi{Xi ^ Y 2 ) + Py(X 2 ^ Fj), (4) 

which can be violated by many choices of settings and states under quantum theory. As a specific example, CHSH identified 
quantum states and settings such that the first probability equals (approximately) 0.85 while the three probabilities on the right 
are each (approximately) 0.15, thus clearly violating @. In full detail, the probability distribution that corresponds to CHSH's 
proof is as follows 



Pr 


Xi=T X 1 = F 


X 2 - T X 2 = F 


F 1 = T 
Yi = F 


0.4267766953 0.0732233047 
0.0732233047 0.4267766953 


0.4267766953 0.0732233047 
0.0732233047 0.4267766953 


Y 2 = T 
Y 2 = F 


0.4267766953 0.0732233047 
0.0732233047 0.4267766953 


0.0732233047 0.4267766953 
0.4267766953 0.0732233047 



In Appendix lIII-Cl we explain how to arrive at this table. The table lists the 4 conditional distributions Q = (Qi\, Q12, Q21 , Q22) 
defined in Section IH-CI and thus uniquely determines the nonlocality proof Q. As an example of how to read the table, note 
that Pr(X 2 ^ Y 2 ) is given by 

Pr(X 2 + Y 2 ) = Pi{X 2 = T&Y 2 = F) +Pr(X 2 = F&T 2 = T) = 0.4267766953 + 0.4267766953 w 0.85, 

showing that the expression on the left in @ is approximately 0.85. That on the right evaluates to approximately 0.45. 
The other nonlocality proofs are derived in a similar manner: one shows that according to any and all local realist theories, the 
random variables , X 2 , Ti , T 2 must satisfy certain logical constraints and hence probabilistic (in)equalities. One then shows 
that these constraints or (in)equalities can be violated by certain quantum mechanical states and settings, giving rise to a table 
of probabilities of observations similar to 0. Details are given in Appendix [H] 

IV. Kullback-Leibler Divergence and Statistical Strength 

A. Kullback-Leibler Divergence 

In this section we formally define our notion of 'statistical strength of a nonlocality proof. The notion will be based on the 
KL divergence, an information theoretic quantity which we now introduce. Let Z be an arbitrary finite set. For a distribution Q 
over Z, Q(z) denotes the probability of event {z}. For two (arbitrary) distributions Q and P defined over Z, the Kullback-Leibler 
(KL) divergence from Q to P is defined as 

o(eii^):=Ee«iogf4 ( 6 ) 

zez r \ z ) 

where the logarithm is taken here, as in the rest of the paper, to base 2. We use the conventions that, for y > 0, ylogO := °°, 
and OlogO := lim^oylogy = 0. 
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The KL divergence is also known as relative entropy, cross-entropy, information deficiency or /-divergence. Introduced in 
[22], KL divergence has become a central notion in information theory, statistics and large deviation theory. A good reference 
is [10]. It is straightforward to show (using concavity of the logarithm and Jensen's inequality) that D(Q\\P) > with equality 
if and only if P = Q; in this sense, KL divergence behaves like a distance. However, in general D(P\\Q) ^ D(Q\\P), so formally 
Z>(- 1| -) is not a distance. (See the examples in Appendix IIV-CI for a clarification of this asymmetry.) 

KL divergence expresses the average disbelief in P, when observing random outcomes Z from Q. Thus occasionally (with 
respect to Q) one observes an outcome Z that is more likely under P than Q, but on average (with respect to Q), the outcomes 
are more likely under Q than P, expressed by the fact that D(Q\\P) > 0. In the Appendices I1V-BI and lIV-Cl we provide several 
properties and examples of the KL divergence. 

KL divergence has several different interpretations and applications. Below we focus on the interpretation we are concerned 
with in this paper: KL divergence as a measure of 'statistical closeness' in the context of statistical hypothesis testing. 

1) KL Divergence and Statistical Strength in Simple Hypothesis Testing: Let Zi,Z2,... be a sequence of random variables 
independently generated either by some distribution P or by some distribution Q with Q^P. Suppose we are given a sample 
(sequence of outcomes) zi,...,z n . We want to perform a statistical test in order to find out whether the sample is from P or 
Q. Suppose that the sample is, in fact, generated by Q ('Q is true'). Then, given enough data, the data will with very high 
(Q-) probability be overwhelmingly more likely according to Q than according to P. That is, the data strongly suggest that 
they were sampled from Q rather than P. The 'statistical distance' between P and Q indicates how strongly or, equivalently, 
how convincingly data that are generated by Q will suggest that they are from Q rather than P. It turns out that this notion of 
'statistical distance' between two distributions is precisely captured by the KL divergence D(Q\\P), which can be interpreted 
as the average amount of support in favor of Q and against P per trial. The larger the KL divergence, the larger the amount 
of support per trial. It turns out that 

1) For a fixed sample size n, the larger D(Q\\P), the more support there will be in the sample z\,---,z n for Q versus P 
(with high probability under Q) . 

2) For a pre-determined fixed level of support in favor of Q against P (equivalently, level of 'confidence' in Q, level of 
'convincingness' of Q), we have that the larger D(Q\\P), the smaller the sample size before this level of support is 
achieved (with high probability under Q). 

3) If, based on observed data zi,..- ,z„, an experimenter decides that Q rather than P must have generated the data, then, 
the larger D(Q\\P), the larger the confidence the experimenter should have in this decision (with high probability under 
Q). 

What exactly do we mean by 'level of support/convincingness'? Different approaches to statistical inference define this notion 
in a different manner. Nevertheless, for large samples, all definitions of support one finds in the literature become equivalent, 
and are determined by the KL divergence up to lower order terms in the exponent. 

For example, in the Bayesian approach to statistical inference, the statistician assigns initial, so-called prior probabilities 
to the hypotheses 'Q generated the data' and 'P generated the data', reflecting the fact that he does not know which of the 
two distributions in fact did generate the data. For example, he may assign probability 1/2 to each hypothesis. Then given 
data Z\ ,Z2, . . . ,Z„, he can compute the posterior probabilities of the two hypotheses, conditioned on this data. It turns out 
that, if Q actually generated the data, and the prior probabilities on both P and Q are nonzero, then the posterior odds that P 
rather than Q generated the data typically behaves as 2~" D (2l|f > )+"(n) Thus, the larger the KL divergence D(Q\\P), the larger 
the odds in favour of Q and therefore, the larger the confidence in the decision 'Q generated the data'. Different approaches to 
hypothesis testing provide somewhat different measures of 'level of support' such as p-values and code length difference. But, 
as we show in Appendix IIVI these different measures of support can also be related to the KL divergence. In the appendix 
we also give a more intuitive and informal explanation of how KL divergence is related to these notions, and we explain why, 
contrary to what has sometimes been implicitly assumed, absolute deviations between probabilities can be quite bad indicators 
of statistical distance. 

2) KL Divergence and Statistical Strength in Composite Hypothesis Testing: Observing a sample generated by Q or P and 
trying to infer whether it was generated by Q or P is called hypothesis testing in the statistical literature. A hypothesis is simple 
if it consists of a single probability distribution. A hypothesis is called composite if it consists of a set of distributions. The 
composite hypothesis "P' should be interpreted as 'there exists aPef that generated the data'. Above, we related the KL 
divergence to statistical strength when testing two simple hypotheses against each other. Yet in most practical applications (and 
in this paper) the aim is to test two hypotheses, at least one of which is composite. For concreteness, suppose we want to test 
the distribution Q against the set of distributions V. In this case, under some regularity conditions on V and Z, the element 
P G V that is closest in statistical divergence to Q determines the statistical strength of the test of Q against V . Formally, for 
a set of distributions V on Z we define (as is customary, [10]) 

D(Q\\V):=MD(Q\\P). (7) 

Analogously to D(Q\\P), D(Q\\'P) may be interpreted as the average amount of support in favor of Q and against V per trial, 
if data are generated according to Q. 
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In our case, QM claims that data are generated by some distribution Q a . LR claims that data are generated by some P g V a , 
where Va '■= {Po-.x '■ it E Yl}. Here Q c corresponds to a nonlocality proof equipped with setting distribution a, and Va is 
the set of probability distributions of all possible local theories with the same a — see Section |n] QM and LR agree to 
test the hypothesis Q a against V a - QM, who knows that data are really generated according to Q c , wants to select a in 
such a way that the average amount of support in favor of Q and against V is maximized. We shall argue in Section fVll that 
QM should restrict himself to uncorrected settings. The previous discussion then suggests that he should pick the a € £ uc 
that maximizes statistical strength D{Q a \\V a )- ln Appendix IIVI we show that this is (in some sense) also the optimal choice 
according to statistical theory. Thus we define the statistical strength of Q as sup aeZ uc D(Q a \\T'a), but we also present results 
for two alternative classes of setting distributions. 

B. Formal Definition of Statistical Strength 

We define 'the statistical strength of nonlocality proof Q' in three different manners, depending on the freedom that we 
allow QM in determining the sampling probabilities of the different measurement settings. 

Definition 1 (Strength for Uniform Settings): When each measurement setting is sampled with equal probability, the resulting 
strength Sg N1 is defined by 

s£ NI := D(Q a o\\V a o) = inf D(Q a o \\ Pa o ), (8) 

where <7° denotes the uniform distribution over the settings. 

Definition 2 (Strength for Uncorrelated Settings): When the experimenter QM is allowed to choose any distribution on 
measurement settings, as long as the distribution for each party is uncorrelated with the distributions of the other parties, the 
resulting strength Sg C is defined by 

s" c := sup D(Qa\\V ) = sup MD(Q a \\P a , K ), (9) 
aeZ uc CTeE uc Ken 

where a G E uc denotes the use of uncorrelated settings. 

Definition 3 (Strength for Correlated Settings): When the experimeniter QM is allowed to choose any distribution on mea- 
surement settings (including correlated distributions), the resulting strength Sg 0R is defined by 

S C „ 0R := supD(Q a \\P a ) = sup inf D(Q a \\P a>7[ ), (10) 
where a e £ denoted the use of correlated settings. 

Throughout the remainder of the paper, we sometimes abbreviate the subscript a S E uc to £ uc , and K S n to II. 

In Section fVII-AI we list some essential topological and analytical properties of our three notions of strength. For now, we 
only need the following reassuring fact (Theorem ^ Section IVH-AI part 2(c)): 

Fact 1: Sg NI < Sq C < Sg 0R . Moreover, Sg NI > if and only if Q is a proper nonlocality proof. 

As we explain in Section IVII we regard the definition Sq C allowing maximization over uncorrelated distributions as the 
'right' one. Henceforth, whenever we speak of 'statistical strength' without further qualification, we refer to Sq C . Nevertheless, 
to facilitate comparisons, we list our results also for the two alternative definitions of statistical strength. 

V. The Results 

The following table summarizes the statistical strengths of the various nonlocality proofs. Note that the numbers in the middle 
column correspond to the 'right' definition S^ c , which considers uncorrelated distributions for the measurement settings. 



Strength 


Uniform Settings S™ 1 


Uncorrelated Settings S^ 


Con-elated Settings Sp 0R 


Original BELL 


0.0141597409 


0.0158003672 


0.0169800305 


Optimized BELL 


0.0177632822 


0.0191506613 


0.0211293952 


CHSH 


0.0462738469 


0.0462738469 


0.0462738469 


Hardy 


0.0278585182 


0.0279816333 


0.0280347655 


Mermin 


0.0157895843 


0.0191506613 


0.0211293952 


GHZ 


0.2075187496 


0.2075187496 


0.4150374993 



(11) 



Example 2 (The CHSH Results): To help interpret the table, we continue our Example \l\on CHSH. The entry in the first 
('uniform') column for the CHSH proof was obtained as follows, a was set to the uniform distribution a° = (1 /4, 1 /4, 1 /4, 1 /4). 
Q was set to the values in Table 0, resulting in a joint distribution Q a ° on measurement settings and outcomes. Q a o was 
used to determine the local theory n* £ Yl that obtains the minimum in 

iMD(Q a o\\p a o ). 

The resulting n* can be found numerically. The corresponding P u b\K distributions is given in Table U 



9 



TABLE I 

Best Classical Theory for CHSH 







a= 1 






a = 2 




P al ,(X = x,Y = x) 


* = T 




x= F 


x = T 




x=F 




0.375 
0.125 




0.125 
0.375 


0.375 
0.125 




0.125 
0.375 


b =i y=l 


0.375 




0.125 


0.125 




0.375 


y= f 


0.125 




0.375 


0.375 




0.125 



The KL divergence between Q a o and P a °,%* can now be calculated. It is equal to 0.0462738469, the left-most entry in 
Table ( II 11 in the CHSH-row. To get the rightmost entry in this row, we performed the same computation for all a (we 
will explain later how to do this efficiently). We found that the resulting KL divergence D(Q a ,P aj[ *) (where n* depends on 
<7) was, in fact, maximized for a = (7°: there was no gain in trying any other value for a. Thus, the rightmost column is equal 
to the leftmost column. Finally, Fact ^ above implies that the middle column entry must be in between the leftmost and the 
rightmost, explaining the entry in the middle column. The corresponding analysis for the other nonlocality proofs is done in 
Appendix ITTT1 

Example 3 (Mermin's "a million runs"): We recall Mermin's quote from the Introduction where he says that "a million 
runs" of his experiment should be enought to convince us that "the observed frequencies . . . are not chance fluctations". We 
now can put numbers to this. 

Assuming that we perform Mermin's experiment with the optimized, uncorrelated settings, we should get a strength of 
1,000,000 x 0.0191506613 w 19,150. This means that after the million runs of the experiment, the likelihood of local realism 
still being true is comparable with the likelihood of a coin being fair after 19,150 tosses when the outcome was "tails" all the 
time. 

^From Table il Q we see that in the two-party setting, the nonlocality proof of CHSH is much stronger than those of Bell, 
Hardy or Mermin, and that this optimal strength is obtained for uniform measurement settings. Furthermore it is clear that the 
three-party proof of GHZ is an four and a half times stronger than all the two-party proofs. 

We also note that the nonlocality proof of Mermin — in the case of non-uniform settings — is equally strong as the optimized 
version of Bell's proof. The setting distributions tables in Appendix IIII-EI shows why this is the case: the optimal setting 
distribution for Mermin exclude one setting on A's side, and one setting on fi's side, thus reducing Mermin's proof to that of 
Bell. One can view this is as an example of how a proof that is easier to understand (Mermin) is not necessarily stronger than 
one that has more subtle arguments (Bell). 

We also see that in general, except for CHSH's proof, uniform setting distributions do not give the optimal strength of 
a nonlocality proof. Rather, the experimenter obtains more evidence for the nonlocality of nature by employing sampling 
freqencies that are biased towards those settings that are more relevant for the nonlocality proof. 

VI. Interpretation and Discussion 

A. Which nonlocality proof is strongest and what does it mean ? 

1 ) Caveat: statistical strength is not the whole story: First of all, we stress that statistical strength is by no means the only 
factor in determining the 'goodness' of a nonlocality proof and its corresponding experiment. Various other aspects also come 
into play, such as: how easy is it to prepare certain types of particles in certain states? Can we arrange to have the time and 
spatial separations which are necessary to make the results convincing? Can we implement the necessary random changes in 
settings per trial, quickly enough? Our notion of strength neglects all these important practical aspects. 

2) Comparing GHZ and CHSH: GHZ is the clear winner among all proofs that we investigated, being about 4.5 times 
stronger than CHSH, the strongest two-party proof that we found. This means that, to obtain a given level of support for QM 
and against LR, the optimal CHSH experiment has to be repeated about 4.5 times as often as the optimal GHZ experiment. 

On the other hand, the GHZ proof is much harder to prepare experimentally. In light of the reasoning above, and assuming 
that both CHSH and GHZ can be given a convincing experimental implementation, it may be the case that repeating the 
CHSH experiment 4.5 x n times is much cheaper than repeating GHZ n times. 

3) Nonlocality 'without inequality'?: The GHZ proof was the first of a new class of proofs of Bell's theorem, "without 
inequalities". It specifies a state and collection of settings, such that all QM probabilities are zero or one, while this is impossible 
under LR. The QM probabilities involved are just the probabilities of the four events in Equation i35\ . Appendix III-DI The 
fact that all these must be either or 1 has led some to claim that the corresponding experiment has to performed only once 
in order to rule out local realism 1 . As has been observed before [28], this is not the case. This can be seen immediately if 
we let LR adopt the uniform distribution on all possible observations. Then, although QM is correct, no matter how often 
the experiment is repeated, the resulting sequence of observations does not have zero probability under LR's local theory — 

'Quoting [28], "The list of authors [claiming that a single experiment is sufficient to invalidate local realism] is too long to be given explicitly, and it would 
be unfair to give only a partial list." 



10 



simply because no sequence of observations has probability under LR's theory. We can only decide that LR is wrong on a 
statistical basis: the observations are much more likely under QM than under LR. This happens even if, instead of using the 
uniform distribution, LR uses the local theory that is closest in KL divergence to the Q induced by the GHZ scenario. The 
reason is that there exists a positive e such that any local realist theory which comes within e of all the equalities but one, 
is forced to deviate by more than e in the last. Thus, accompanying the GHZ style proof without inequalities, is an implied 
inequality, and it is this latter inequality that can be tested experimentally. 

B. Is our definition of statistical strength the right one? 

We can think of two objections against our definition of statistical strength. First, we may wonder whether the KL divergence 
is really the right measure to use. Second, assuming that KL divergence is the right measure, is our game-theoretic setting 
justified? We treat both issues in turn. 

1) Is Kullback-Leibler divergence justified?: We can see two possible objections against KL divergence: (1) different 
statistical paradigms such as the 'Bayesian' and 'frequentist' paradigm define 'amount of support' in different manners 
(Appendix II V- A> : (2) 'asymptopia': KL divergence is an inherently asymptotic notion. 

These two objections are inextricably intertwined: there exists no non-asymptotic measure which would (a) be acceptable to 
all statisticians; (b) would not depend on prior considerations, such as a 'prior distribution' for the distributions involved in the 
Bayesian framework, and a pre-set significance level in the frequentist framework. Thus, since we consider it most important 
to arrive at a generally acceptable and objective measure, we decided to opt for the KL divergence. We add here that even 
though this notion is asymptotic, it can be used to provide numerical bounds on the actual, non-asymptotic amount of support 
provided on each trial, both in Bayesian and in frequentist terms. We have not pursued this option any further here. 

2) Game-Theoretic Justification: There remains the question of whether to prefer Sg NI , Sg 0R or, as we do, Sg C . The problem 
with Sg NI is that, for any given combination of nonlocality proof Q and local theory K, different settings may provide, on 
average, more information about the nonlocality of nature than others. This is evident from Table ( II It . We see no reason for 
the experimenter not to exploit this. 

On the other hand, allowing QM to use correlated distributions makes QM's case much weaker: LR might now argue that 
there is some hidden communication between the parties. Since QM's goal is to provide an experiment that is as convincing 
as possible to LR, we do not allow for this situation. Thus, among the three definitions considered, Sg C seems to be the 
most reasonable one. Nevertheless, one may still argue that none of the three definitions of strength are correct: they all seem 
unfavourable to QM, since we allow LR to adjust his theory to whatever frequency of measurement settings QM is going 
to use. In contrast, our definition does not allow QM to adjust his setting distribution to LR's choice (which would lead to 
strength defined as infsup rather than supinf). The reason why we favour LR in this way is that the quantum experimenters QM 
should try to convince LR that nature is nonlocal in a setting about which LR cannot complain. Thus, if LR wants to entertain 
several local theories at the same time, or wants to have a look at the probabilities a a \, before the experiment is conducted, 
QM should allow him to do so — he will still be able to convince LR, even though he may need to repeat the experiment a 
few more times. Nevertheless, in developing clever strategies for computing Sg C , it turns out to be useful to investigate the 
infsup scenario in more detail. This is done in Section fVII-BI 

Summarizing, our approach is highly nonsymmetric between quantum mechanics and local realism. There is only one 
quantum theory, and QM believes in it, but he must arm himself against any and all local realists. 2 

C. Related Work by Peres 

Earlier work in our direction has been done by Peres [28] who adopts a Bayesian type of approach. Peres implicitly uses the 
same definition of strength of nonlocality proofs as we do here, after merging equal probability joint outcomes of the experiment. 
Our work extends his in several ways; most importantly, we allow the experimentalist to optimize her experimental settings, 
whereas Peres assumes particular (usually uniform) distributions over the settings. Peres determines LR's best theory by an 
inspired guess. The proofs he considers have so many symmetries, that the best LR theory has the same equal probability joint 
outcomes as the QM experiment, the reduced experiment is binary, and his guess always gives the right answer. His strategy 
would not work for, e.g., the Hardy proof. 

Peres starts out with a nonlocality proof Q a to be tested against local theory P a -ji, for some fixed distribution a. Peres 
then defines the confidence depressing factor for n trials. In fact, Peres rediscovers the notion of KL divergence, since a 
straightforward calculation shows that for large n, 

D(Qa\\Pcrn) = — logfconf idence depressing factor). (12) 
n 

2 Some readers might wonder what would happen if one would replace the D(Q||P) in our analysis by D{P\\Q). In short, D(P\\Q) quantifies how strongly 
the predictions of quantum mechanics disagree with the outcomes of a classical system P. Hence such an analysis would be useful if one has to prove that the 
statistics of a local realistic experiment (say, a network of classically communicating computers) are not in correspondance with the predictions of quantum 
mechanics. The minimax solution of the game based on D(P\\Q) provides a value of Q that QM should specify as part of a challenge to LR to reproduce 
quantum predictions with LR's theory. With this challenge, the computer simulation using LR's theory can be run in as short as possible amount of time, 
before giving sufficient evidence that LR has failed. 
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For any given large n, the larger the confidence depressing factor for n, the more evidence against P a - K we are likely to get 
on the basis of n trials. Thus, when comparing a fixed quantum experiment (with fixed o) Q a to a fixed local theory P a -,K, 
Peres' notion of strength is equivalent to ours . Peres then goes on to say that, when comparing a fixed quantum experiment 
Q a to the corresponding set of all local theories To, we may expect that LR will choose the local theory with the least 
confidence depressing factor, i.e. the smallest KL divergence to Q a . Thus, whenever Peres chooses uniform <7, his notion 
of strength corresponds to our Sg NI , represented in the first column of Table In practice, Peres chooses an intuitive a, 
which is usually, but not always uniform in our sense. For example, in the GHZ scenario, Peres implicitly assumes that 
only those measurement settings are used that correspond to the probabilities (all or 1) appearing in the GHZ-inequality 
d35i . Appendix Ill-DI Thus, his experiment corresponds to a uniform distribution on those four settings. Interestingly, such a 
distribution on settings is not allowed under our definition of strength Sg C , since it makes the probability of the setting at party 
A dependent on (correlated with) the other settings. This explains that Peres obtains a larger strength for GHZ than we do: 
he obtains log0.75~" =0.4150..., which corresponds to our Sg 0R : the uniform distribution on the restricted set of settings 
appearing in the GHZ proof turns out to be the optimum over all distributions on measurement settings. 

Our approach may be viewed as an extension of Peres' in several ways. First, we relate his confidence depressing factor 
to the Kullback-Leibler divergence and show that this is the right measure to use not just from a Bayesian point of view, 
but also from an information-theoretic point of view and the standard, 'orthodox' frequentist statistics point of view. Second, 
we extend his analysis to non-uniform distributions O over measurement settings and show that in some cases, substantial 
statistical strength can be gained if QM uses non-uniform sampling distributions. Third, we give a game-theoretic treatment of 
the maximization of a and develop the necessary mathematical tools to enable fast computations of statistical strength. Fourth, 
whereas he finds the best LR theory by cleverly guessing, we show the search for this theory can be performed automatically. 

D. Future Extensions and Conjectures 

The purpose of our paper has been to objectively compare the statistical strength of existing proofs of Bell's theorem. The 
tools we have developed, can be used in many further ways. 

Firstly, one can take a given quantum state, and ask the question, what is the best experiment which can be done with it. 
This leads to a measure of statistical nonlocality of a given joint state, whereby one is optimizing (in the outer optimization) 
not just over setting distributions, but also over the settings themselves, and even over the number of settings. 

Secondly, one can take a given experimental type, for instance: the 2x2x2 type, and ask what is the best state, settings, and 
setting distribution for that type of experiment? This comes down to replacing the outer optimization over setting distributions, 
with an optimization over states, settings, and setting distribution. 

Using numerical optimization, we were able to analyse a number of situations, leading to the following conjectures. 

Conjecture 1: Among all 2 x 2 x 2 proofs, and allowing correlated setting distributions, CHSH is best. 

Conjecture 2: Among all 3 x 2 x 2 proofs, and allowing correlated setting distributions, GHZ is best. 

Conjecture 3: The best experiment with the Bell singlet state is the CHSH experiment. 
In [1] Acfn et al. investigated the natural generalization of CHSH type experiments to qutrits. Their main interest was the 
resistance of a given experiment to noise, and to their surprise they discovered in the 2x2x3 case, that a less entangled state 
was more resistant to noise than the maximally entangled state. After some preliminary investigations, we found that that a 
similar experiment with an even less entangled state gives a stronger nonlocality experiment. 

Conjecture 4: The strongest possible 2x2x3 nonlocality proof has statistical strength 0.077, and it uses the bipartite state 
«0.6475|1,1)+0.6475|2,2)+0.4019|3,3). 

If true, this conjecture is in remarkable contrast with what appears to be the strongest possible 2x2x3 nonlocality proof that 
uses the maximally entangled state (| 1, 1) + |2,2) + |3,3))/\/3, which has a statistical strength of only 0.058. 

Conjecture|4]suggests that it is not always the case that a quantum state with more 'entropy of entanglement' [6] will always 
give a stronger nonlocality proof. Rather, it seems that entanglement and statistical nonlocality are different quantities. One 
possibility however is that the counterintuitive results just mentioned would disappear if one could do joint measurements on 
several pairs of entangled qubits, qutrits, or whatever. A regularized measure of nonlocality of a given state, would be the limit 
for k — ► oo, of the strength of the best experiment based on k copies of the state (where the parties are allowed to make joint 
measurements on k systems at the same time), divided by k. One may conjecture for instance that the best experiment based 
on two copies of the Bell singlet state is more than twice as good as the best experiment based on single states. That would 
be a form of "superadditivity of nonlocality", quite in line with other forms of superadditivity which is known to follow from 
entanglement. 

Conjecture 5: There is an experiment on pairs of Bell singlets, of the 2x4x4 type, more than twice as strong as CHSH, 
and involving joint measurements on the pairs. 

VII. Mathematical and Computational Properties of Statistical Strength 

Having presented and discussed the strengths of various nonlocality proofs, we now turn to the second, more mathematical 
part of the paper. We first prove several mathematical properties of our three variations of statistical strength. Some of these 
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are interesting in their own right, giving new insights in the relation between distributions predicted by quantum theory and 
local realist approximations of it. But their main purpose is to help us compute Sg C . 

A. Basic Properties 

We proceed to list some essential properties of Sg NI ,Sg C and Sg 0R . We say that "nonlocality proof Q is absolutely continuous 
with respect to local realist theory %" [12] if and only if for all a,b G {1,2}, x,y € {T,F}, it holds that if Q a b{ x -y) > then 
Pab;x{x,y)>0. 

Theorem 1: Let Q be a given (not necessarily 2x2x2) nonlocality proof and II the corresponding set of local realist 
theories. 

1) Let U(a,K):=D(Q a \\P a[ „), then: 

a) For a 2 x 2 x 2 proof, we have that 

U{a,%)= £ OalMQabiWabA-))- (13) 

a,be{l,2} 

Hence, the KL divergence D(Q a \\P a - tK ) may alternatively be viewed as the average KL divergence between the 
conditional distributions of (X,Y) given the setting (A,B), where the average is over the setting. For a generalized 
nonlocality proof, the analogous generalization of Equation Jl 3i holds. 

b) For fixed (7, U{a,%) is convex and lower semicontinuous on n, and continuous and differentiable on the interior 

of n. 

c) If Q is absolutely continuous with respect to some fixed it, then U(o,%) is linear in a. 



2) Let 



U(a):=MU(a,7c), (14) 



then 

a) For all a € E, the infimum in Equation dl4> is achieved for some n*. 

b) The function U(o) is nonnegative, bounded, concave and continuous on cr. 

c) If Q is not a proper nonlocality proof, then for all a £ E, C/(cj) = 0. If Q is a proper nonlocality proof, then U (cr) > 
for all <7 in the interior of E. 

d) For a 2 party, 2 measurement settings per party nonlocality proof, we further have that, even if Q is proper, then 
still U (a) = for all a on the boundary of E. 

3) Suppose that a is in the interior of E, then: 

a) Let (2bea2x2x2 nonlocality proof. Suppose that Q is non-trivial in the sense that, for some a,b, Q a t, is not a 
point mass (i.e. < Qab{x,y) < 1 for some x,y). Then fell achieves the infimum in Equation dl4i if and only 
if the following 16 (in)equalities hold: 

Oabp 7 7 = 1 (15) 

a,be{l.2} r ab;x*{Xa,yb) 

for all {x\,x 2 ,y\,y2) G {T,F} 4 such that Tt*^ yuy2 > 0, and 

L Qab(Xg,yb) ^ 1 no 
OabT, 7 r < 1 (16) 

Pab-K* {Xa,yb) 



a,be{\2} 



for all (xi,x 2 ,yi,y 2 ) G {T,F} 4 such that n* uX2yiM =0. 

For generalized nonlocality proofs, n* £ Yl achieves Equation dl4> if and only if the corresponding analogues of 

Equations (I15> and d!6i both hold, 
b) Suppose that n* and n° both achieve the infimum in Equation J14t . Then for all x,y € {T, F}, a, b G {1,2} with 

Qab( x ,y) > 0, we have P a b;n*{x,y) = P a b;jt°(x,y) > 0. In words, n* and n° coincide in every measurement setting 

for every measurement outcome that has positive probability according to Q a , and Q is absolutely continuous with 

respect to n* and n° . 
The proof of this theorem is in Appendix I V-B I 

In general, infnC/(<7, n) may be achieved for several, different %. By part 2 of the theorem, these must induce the same four 
marginal distributions P u b;n- It a ls° follows directly from part 2 of the theorem that, for 2 x 2 x 2 proofs, Sg C := sup £U c U{o) 
is achieved for some a* e E uc , where a* b > for all a,b <E {1,2}. 
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B. Game -Theoretic Considerations 

The following considerations will enable us to compute Sg C very efficiently in some special cases, most notably the CHSH 
proof. 

We consider the following variation of our basic scenario. Suppose that, before the experiments are actually conducted, LR 
has to decide on a single local theory Kq (rather than the set II) as an explanation of the outcomes that will be observed. QM 
then gets to see this %, and can choose a depending on the 7To that has been chosen. Since QM wants to maximize the strength 
of the experiment, he will pick the a achieving sup 2;U cZ)(2<t||^ > ct;^o)- ^ n sucn a scenario, the 'best' LR theory, minimizing 
statistical strength, is the LR theory 7Zq that minimizes, over % £ II, sup£uc D(Qa\\Pa;x)- Thus, in this slightly different setup, 
the statistical strength is determined by 

t% c :=iidsupD(Q a \\P a .,„) (17) 

n ^uc 

rather than Sg C := sup IUC infn£>(2cr||^ > fj;7r)- Below we show that Sg C > Sg C . As we already argued in Section [VTI we consider 
the definition Sg C to be preferable over Sg C . Nevertheless, it is useful to investigate under what conditions Sg C = Sg C . Von 
Neumann's famous minimax theorem of game theory [26] suggests that 

supinfD(Q a \\P a .„)=infsupD(Q a \\P a ., J[ ), (18) 
£* n n £* 

if E* is a convex subset of E. Indeed, Theorem [2] below shows that Equation (1181 holds if we take E* = E. Unfortunately, E uc 
is not convex, and Equation dl 81 does not hold in general for E* = E uc , whence in general Sg C ^ Sq C . Nevertheless, Theorem[3] 
provides some conditions under which Equation Jl 8I > does hold with E* = E uc . In Section fVII-CI we put this fact to use in 
computing Sg C for the CHSH nonlocality proof. But before presenting Theorems [2] and [3] we first need to introduce some 
game-theoretic terminology. 
1) Game -Theoretic Definitions: 

Definition 4 (Statistical Game [13]): A statistical game is a triplet (A,B,L) where A and B are arbitrary sets and L:AxB^ 
IRU{— oOjOo} is a loss function. If 



we say that the game has value V with 



If for some (a*,/3*) eAxBwe have 



sup inf L(a,j3) = inf supL(a,j3), (19) 

aeAP&B P&BaeA 



V := sup inf L(a,j3). (20) 



For all a£A: L(a, J3*) < L{a\ /3*) 
For all J3 £ B: L(a*,j3) > L(a*,j3*) 

then we call (a*,j5*) a saddle point of the game. It is easily seen (Proposition ^ Appendix |VJ that, if a° achieves 
sup ceGA inf J g GB L(a,/3) and /3° achieves inf p eB L(a, j3) and the game has value V, then (a°,)3 ) is a saddle point and L(a°,/3°) = 
V. 

Definition 5 (Correlated Game): With each nonlocality proof we associate a corresponding correlated game, which is just 
the statistical game defined by the triple (E, II, 17), where U : E x II — ► lU {°°} is defined by 

U{a,n):=D{Q a \\P a;% ). (21) 

By the definition above, if this game has a value then it is equal to V defined by 

V := inf supfT '(a, %) = sup inf U (a, 7c). (22) 
n £ £ n 

We call the game correlated because we allow distributions a over measurement settings to be such that the probability that 
party A is in setting a is correlated with (is dependent on) the setting b of party B. The fact that each correlated game has a 
well defined value is made specific in Theorem |2] below. 

Definition 6 (Uncorrelated Game): Recall that we use E uc to denote the set of vectors representing uncorrected distributions 
in E. With each nonlocality proof we can associate the game (E UC ,II,J/) which we call the corresponding uncorrelated game. 
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2) Game-Theoretic, Saddle Point Theorems: 

Theorem 2 (Saddle point for Potentially Correlated Settings): For every (generalized) nonlocality proof, the correlated game 
(IT,£,E/) corresponding to it has a finite value, i.e. there exist a < V < °° with inrnsupj-L^a, %) = V = sup£infnt/(a,7r). 
The infimum on the left is achieved for some n* g II; the supremum on the right is achieved for some a* in E, so that (n*,0*) 
is a saddle point. 

The proof of this theorem is in Appendix IV-C.2I 

In the information-theoretic literature, several well-known minimax and saddle point theorems involving the Kullback-Leibler 
divergence exist; we mention [21], [33]. However, all these deal with settings that are substantially different from ours. 

In the case where there are two parties and two measurement settings per party, we can say a lot more. 

Theorem 3 (Saddle point for 2x2x N Nonlocality Proofs): Fix any proper nonlocality proof based on 2 parties with 2 
measurement settings per party and let (T.,Yl,U) and (£ uc ,n,i/) be the corresponding correlated and uncorrected games, 
then: 

1) The correlated game has a saddle point with value V > 0. Moreover, 

supinf U(a, n) < supinf U(a, n) = V, (23) 
£uc n £ n 

infsupl/(<7,7r) =infsupi/((T,7z;) —V. (24) 
n ^uc n £ 

2) Let 

II* := {it : K achieves inf supf/((7, 7t)}, (25) 
n £ 

n uc * := {n : n achieves infsup£/(o\ ft)}, (26) 
n ^uc 

then 

a) n* is non-empty. 

b) n* = n uc *. 

c) All n* € n* are 'equalizer strategies', i.e. for all a S £ we have the equality U((J,7t*) = V. 

3) The uncorrected game has a saddle point if and only if there exists (n*,o*), with a* € E uc , such that 

a) n* achieves mf u U(o*,7t). 

b) n* is an equalizer strategy. 

If such (ff*,7T*) exists, it is a saddle point. 
The proof of this theorem is in Appendix IV-C. 31 



C. Computing Statistical Strength 

We are now armed with the mathematical tools needed to compute statistical strength. By convexity of U (a, n) in n, we see 
that for fixed <7, determining D(Q c \\P a ) = infnC/(c7, ^) is a convex optimization problem, which suggests that numerical 
optimization is computationally feasible. Interestingly, it turns out that computing infnf/((J,7r) is formally equivalent to 
computing the maximum likelihood in a well-known statistical missing data problem. Indeed, we obtained our results by 
using a 'vertex direction algorithm' [16], a clever numerical optimization algorithm specifically designed for statistical missing 
data problems. 

By concavity of U(o) as defined in Theorem ^ we see that determining Sg 0R is a concave optimization problem. Thus, 
numerical optimization can again be performed. There are some difficulties in computing the measure Sg C , since the set E uc 
over which we maximize is not convex. Nevertheless, for the small problems (few parties, particles, measurement settings) we 
consider here it can be done. 

In some special cases, including CHSH, we can do all the calculations by hand and do not have to resort to numerical 
optimization. We do this by making an educated guess of the a* achieving sup^uc D(Q a \\T'a), and then verify our guess using 
Theorem^and the game-theoretic tools developed in Theorem[3] This can best be illustrated using CHSH as an example. 

Example 4 (CHSH, continued): Consider the CHSH nonlocality argument. The quantum distributions Q, given in the table 
in Section HJl] have traditionally been compared with the local theory ft defined by 

^FFFF = ^TTTT = ^FFFT = ^TTTF = ^FFTF = ^TTFT = ^TFFT = %TTF = (27) 

and ft Xl x 2 y iy2 = otherwise. This gives rise to the following probability table: 



Pab\J[ 


X\ =T X x = F 


X 2 = J X 2 = F 


Ti=T 
Yi = F 


0.375 0.125 
0.125 0.375 


0.375 0.125 
0.125 0.375 


Y 2 = T 
Y 2 = F 


0.375 0.125 
0.125 0.375 


0.125 0.375 
0.375 0.125 



15 



There exists no local theory which has uniformly smaller absolute deviations from the quantum probabilities in all four tables. 
Even though, in general, absolute deviations are not a good indicator of statistical strength, based on the fact that all four tables 
'look the same', we may still guess that, in this particular case, for uniform measurement settings a a b = 1 /4, a,b S {1,2}, the 
optimal local realist theory is given by the ft defined above. We can now use Theorem^ part 3(a) to check our guess. Checking 
the 16 equations d!5l > and d!6i shows that our guess was correct: ft achieves inf U [a, n) for the uniform measurement settings 
a. It is clear that ft is an equalizer strategy and that a is uncorrected. But now Theorem [3] part (3) tells us that (o,ft) is a 
saddle point in the uncorrected game. This shows that & achieves sup E Lc infriD(Q a \\P a ). Therefore, the statistical strength of 
the CHSH nonlocality proof must be given by 

s" c = supinfD(e <J ||P (J ) =D(Q s \\P 6 .ft), (29) 
£uc n 

which is straightforward to evaluate. 
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Appendix I 

Beyond 2x2x2: General Case of Nonlocality Proofs 

Here we extend the 2x2x2 setting to more than two parties, settings and outcomes. A general nonlocality proof is defined 
as a tuple (k,S,X,Q) where 

1) k is the number of parties, 

2) S := Si X • • • X >Sjfc is the set of possible measurement settings. 

a) Sj := {1,2, . . . ,Nj} is the set of measurement settings for party j. 

b) Nj is the number of settings of party j. 

3) X := X\ X • • • X Affc is the set of possible measurement outcomes. 

a) Xj := Xn\\ x • • • x Xfj iN s\ is the set of measurement outcomes for party j. 

b) X(j jS ) '■= {1,2, . . . ,Ny s j} is the set of measurement outcomes for party j when party j chooses setting s. 

c) Nf. » is the number of measurement outcomes for party j when party j chooses setting s. 

d) (X\ , . . . are the random variables indicating the outcome at parties 1,2, . . . ,k. 

4) Q := (Q si ...s k :(*!,... ,Sk) £ S) is a list of all the distributions Q sl ...s k {X\ = •,■■■ ,X^ = •)> one f° r eacn joint measurement 
setting (*!,..., j^) 6 5. These are the distributions on outcomes induced by the state that the quantum experimenter's 
entangled states are in. 

To each nonlocality proof (k,S,X,Q) there corresponds a set of local realist distributions n. Each such distribution is identified 
with its probability vector %. Formally, n is a distribution for the tuple of random variables 



X (i,i) ••• X ( 



(30) 



Here Xnj) denotes LR's distribution of Zj when party jf's measurement device is in setting s. 

Once again, we call a nonlocality proof proper if and only if it violates local realism, i.e. if there exists no local realist 
distribution n such that P Sx ...s k \n(;) = Qs x -sJ^) for all (si, . . . ,s^) £ S. 

The definition of statistical strength remains unchanged. 

Appendix II 
The Nonlocality Arguments 

In this Appendix we present the inequalities and logical constraints that must hold under local realism yet can be violated 
under quantum mechanics. The specific quantum states chosen to violate these inequalities, as well as the closest possible (in 
the KL divergence sense) local theories are listed in Appendix IIIII 

A. Arguments of Bell and CHSH 

CHSH's argument was described in Example [2 By exactly the same line of reasoning as used in obtaining the CHSH 
inequality @, one also obtains Bell's inequality 

Pr(X! = Yi ) < Pv(X 2 ^Y 2 )+ Pi(X 2 t l Y 1 )+ Pr(*i +Y 2 ). (31) 

See Sections IIII-AI and IIII-BI for how this inequality can be violated. 



B. Hardy's Argument 

Hardy noted the following: if (X2&F2) is true, and (X 2 => Y\) is true, and {Y 2 => X\) is true, then (X\&Y\) is true. Thus 
{X2&Y2) implies: ^{X 2 => Y x ) or ->(Y 2 => Xi) or (X1&Y1). Therefore 

Pr(X 2 &T 2 ) < Pr(X 2 &-Ti) +Pr{^X l &Y 2 ) +Pi(X 1 &Y 1 ). (32) 

On the other hand, according to quantum mechanics it is possible that the first probability is positive, in particular, equals 
0.09, while the three other probabilities here are all zero. See Section UlI-DI for the precise probabilities. 
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C. Mermin's Argument 

Mermin's argument uses three settings on both sides of the two parties, thus giving the set of six events {X\, Y\,X2, Y 2 , X 3 ,Y 3 }. 
First, observe that the three equalities in (X\ — Yi)&(X2 — y 2 )&(X3 — Y 3 ) implies at least one of the three statements in 
((Xi = Y 2 )&(X 2 = 7i)) V {{Xi = Y 3 )&{X 3 = Yi)) V ((X 2 = Y 3 )&(X 3 = Y 2 )). By the standard arguments that we used before, we 
see that 



and that 



1 -Pr(Xi + Y t ) - Pr(X 2 + Y 2 ) - Pr(X 3 + Y 3 ) < Pr((Xi - Y t )&(X 2 = Y 2 )&{X 3 = Y 3 )), 



Pr 



( m 


= Y 2 )&(X 2 




V 




= Y 3 )&(X 3 




V 


V ((^2 


= Y 3 )&(X 3 



< 



< 



( Pr((X 1 =Y 2 )&(X 2 = Y l )) \ 
+ 

Pi((X 1 =Y 3 )&(X 3 =Y 1 )) 



V Pr((X 2 =y 3 )&(X 3 =y 2 )) / 
/ Pr(X 1 =y 2 )+Pr(X 2 -F 1 ) \ 



Pr(X 1 =y 3 )+Pr(X 3 =y 1 ) 
+ 

V Pr(X 2 =y 3 )+Pr(X 3 =y 2 ) J 



As a result we have the 'Mermin inequality' 



i=i L ij=\ 

which gets violated by a state and measurement setting that has probabilities Pr(X, 7^ y ) = and Pr(X, = Yj) = i for i ^= j 
(see Section UlI-EI in the appendix). 



D. GHZ's Argument 

Starting with [15], GHZ, proofs against local realism have been based on systems of three or more qubits, on systems of 
higher-dimensional quantum systems, and on larger sets of measurements (settings) per particle. Each time we are allowed to 
search over a wider space we may be able to obtain stronger nonlocality proofs, though each time the actual experiment may 
become harder to set up in the laboratory. 

Let © denote the exclusive or operation such that X © Y is true if and only if X 7^ Y . Then the following implication must 
hold 

((X 1 ®Y2=Z 2 )&(X 2 ®Y 1 =Z 2 )&(X 2 ®Y 2 =Z 1 )) =>■ (X 1 @Y l =Z 1 ). (33) 
Now, by considering the contrapositive, we get 

Pr(Xi © Yi ^ Z x ) < Pr((X! © Y 2 ^ Z 2 ) V (X 2 © y ? Z 2 ) V (X 2 © Y 2 + Z x )). (34) 
And because Pr(X©y ^Z) = Pr(X ©y©Z), this gives us GHZ's inequality: 

Pr(Xi © Yi © Z x ) < Pr(Xi © Y 2 © Z 2 ) + Pr(X 2 © Y l © Z 2 ) + Pr(X 2 ®Y 2 ®Z { ). (35) 

This inequality can be violated by a three way entangled state and measurement settings that give Pr(Xi © Y\ © Z\ ) = 1 and 
Pr(Xi © y 2 © Z 2 ) = Pr(X 2 © Yi ©Z 2 ) = Pr(X 2 © T 2 © Zi ) = 0. The details of this proof are in Section lnTFl 



Appendix III 

The Nonlocality Proofs, Their Optimal Setting Distributions and Best Classical Theories 

In this appendix we list the nonlocality proofs of Bell, an optimized version of Bell, CHSH, Hardy, Mermin and GHZ and 
their solutions. The proofs themselves are described by a multipartite quantum state and the measurement bases \m) of the 
parties. Because all bases are two dimensional in the proofs below, it is sufficient to only describe the vector \m), where it 
is understood that the other basis vector (| _L m.)) is the orthogonal one. Because of its frequent use, we define for the whole 
appendix the rotated vector \R(<j>)) '■= cos(0)|O) +sin(0)|l). A measurement setting refers to the bases that parties use during 
a trial of the experiment. All proofs, except Mermin's, have two different settings per party (in Mermin they have three). 

Given the state and the measurement bases, the proof is summarized in a table of probabilities of the possible measurement 
outcomes. Here we list these probabilities conditionally on the specific measurement settings. For example, for Bell's original 
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nonlocality proof, which uses the state := -^(|0a0b) + |1a1b)) and the measurement vectors \X = T) a =i '■= \R(0)) and 
\Y = T) b =i ■= |-R(f ))> we list the probability Q n (X = T,Y = 7) = \{^f\X = T,Y = T) a=Lh=1 \ 2 w 0.4268 in the table. 

As discussed in the article (Section flV-B> . the strength of a nonlocality proof will depend on the probabilities a with which 
the parties use the different measurement settings. Recall that we defined three different notions of strength, depending on how 
these probabilities are determined: uniform settings (Sg NI ), uncorrected settings (Sg C ) and correlated settings (Sg 0R ). For both 
the correlated and the uncorrected settings, the parties can optimize their setting distributions to get the strongest possible 
statistics to prove the nonlocality of their measurement outcomes. We list these optimal distributions below where, for example, 
Pr(a = 1) = ffio + <7n stands for the probability that party A uses the measurement basis {\(X = T\a = 1)), \(X = F\a = 1))} 
and Pr(a = l,b = 2) = a ab is the probability that A uses the basis {\(X = T\a = 1)), \ (X = F\a = 1))} while B uses the basis 
{|(r = T|fc = 2)>,|(Y = F|& = 2)>},etc. 

Associated with these optimal distributions there is an optimal local realist theory it <E II (see Section lTV-B> . The probabilities 
for such optimal classical theories are listed below as well and should be compared with the tables of the nonlocality proofs. 
Combining these data tables for each proof and each scenario we obtain the strengths that were listed in Section [V] 

A. Original Bell 

For Bell's proof of nonlocality we have to make a distinction between the original version, which Bell described [5], and 
the optimized version, which is described by Peres in [27]. 

First we discuss Bell's original proof. Take the bipartite state Jj|0a0b) + -^|1a1b), and the measurement settings 

\X = 7)0=1 := |*(0)) and \X = T) a=2 := |*(f )) 
\Y = T) b=1 := |tf (f )) and \Y = T) h=2 := |tf (f )) 

With these settings, quantum mechanics predicts the conditional probabilities of Table UTI (where ^ + \ \/2 ~ 0.4267766953 
and i - i V2 » 0.0732233047). 

(1) Uniform Settings, Original Bell: When the two parties use uniform distributions for their settings, the optimal classical 
theory is the one described in Table fTTTI The corresponding KL distance is 0.0141597409. 

(2) Uncorrelated Settings, Original Bell: The optimized, uncorrelated setting distribution is described in Table ffvl The 
probabilities of the best classical theory for this uncorrelated setting distribution are those in Table [V] The KL distance for 
Bell's original proof, with uncorrelated measurement settings is 0.0158003672. 

(3) Correlated Settings, Original Bell: The optimized, correlated setting distribution is described in Table lVIl The probabilities 
of the best classical theory for this distribution are described in Table IVTll The corresponding KL distance is 0.0169800305. 

B. Optimized Bell 

Take the bipartite state 4j|0a0b) + -Jj|1a1b), and the measurement settings 

\X = T) u=1 := |*(0)) and \X = T) a=2 := |tf (f )> 
\Y = T) 6=1 := |J?(0)) and \Y - T) b=2 := \R(§ )). 

With these settings, quantum mechanics predicts the conditional probabilities of Table IVIIII 

(1) Uniform Settings, Optimized Bell: For the uniform setting distribution the best classical approximation is given in 
Table EU which gives a KL distance of 0.0177632822. 

(2) Uncorrelated Settings, Optimized Bell: The optimal, uncorrelated setting distribution is given in Table IXI The probabilities 
of the best classical theory for this distribution are those of Table IXII The corresponding KL distance is 0.0191506613. 

(3) Correlated Settings, Optimized Bell: The optimal correlated setting distribution is given in Table fxTTI The probabilities 
of the best classical theory for this distrubtion is given in Table IXIlTI The corresponding KL distance is 0.0211293952. 

C. CHSH 

The bipartite state -^|0a0b) + | 1a Lb) ■ A's and B's measurement settings are: 

\X = T) a =i ■= \R(0)) and \X = T) a=2 := |tf(f )), (36) 
\Y = T) b=l := |*(f )) and \Y - J) b=2 := |J?(-f )). (37) 

With these settings, quantum mechanics predicts the conditional probabilities of Table IXIVI (with ^ + I a/2 w 0.4267766953 
and 1 - i \/2 w 0.0732233047). 

Uniform, Uncorrelated and Correlated Settings, CHSH : The optimal measurement settings is the uniform settings, where 
both A and B perform uses one of the two measurements with probability 0.5 (that is a ab — 0.25) 

The optimal classical theory in this scenario has the probabilities of Table IXVI 
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TABLE II 

Quantum Predictions Original Bell 



Qab(X=X 


Y 


= y) 


a — 

x = J 


1 

x=F 


a 

x = T 


= 2 

x = F 


6=1 


y 
y 


= T 
= F 


0.4267766953 
0.0732233047 


0.0732233047 
0.4267766953 


0.5 





0.5 


6 = 2 


y 
y 


= T 
= F 


0.25 
0.25 


0.25 
0.25 


0.4267766953 
0.0732233047 


0.0732233047 
0.4267766953 



TABLE III 

Best Classical Theory for Uniform Settings Original Bell. KL Distance: 0.0141597409. 



Pab(X = 


= x,Y 


= x) 


a = 

x = T 


1 

x= F 


a 

x = T 


= 2 

x=F 


b = 


i y 

y 


= T 
= F 


0.3970311357 
0.1029688643 


0.1029688643 
0.3970311357 


0.5000000000 
0.0000000000 


0.0000000000 
0.5000000000 


b = 


2 y 

y 


= T 
= F 


0.2940622714 
0.2059377286 


0.2059377286 
0.2940622714 


0.3970311357 
0.1029688643 


0.1029688643 
0.3970311357 



TABLE IV 

Optimized Uncorrelated Setting Distribution Original Bell 



Pr(A = a.B = b) = a ab e E uc 


0=1 


a = 2 


Pr(B = 6) 


6=1 


0.2316110419 


0.1327830656 


0.3643941076 


6 = 2 


0.4039948505 


0.2316110419 


0.6356058924 


Pr(A = a) 


0.6356058924 


0.3643941076 





TABLE V 

Best Classical Theory for Uncorrelated Settings Original Bell. KL Distance: 0.0158003672. 



Pab(X=X 


,Y = 


x) 


a 

v T 


1 

x=F 


a 

x = T 


= 2 

x=F 


6 = 1 


y — 

y = 


T 
F 


0.3901023259 
0.1098976741 


0.1098976741 
0.3901023259 


0.5000000000 
0.0000000000 


0.0000000000 
0.5000000000 


6 = 2 


y = 
y = 


T 
F 


0.2802046519 
0.2197953481 


0.2197953481 
0.2802046519 


0.3901023259 
0.1098976741 


0.1098976741 
0.3901023259 



TABLE VI 

Optimized Correlated Setting Distribution Original Bell 



Pr(A = o,B = 6) = a ab e E 


0= 1 


a = 2 


6=1 


0.2836084841 


0.1020773549 


6 = 2 


0.3307056768 


0.2836084841 



TABLE VII 

Best Classical Theory for Correlated Settings Original Bell. KL Distance: 0.0169800305. 



Pab(X = 


= x,Y 


= x) 


£1 = 

x = T 


1 

x=F 


a 

x = T 


= 2 

x=F 


6 = 


i y 

y 


= T 
= F 


0.3969913979 
0.1030086021 


0.1030086021 
0.3969913979 


0.4941498806 
0.0058501194 


0.0058501194 
0.4941498806 


6 = 


2 y 

y 


= T 
= F 


0.2881326764 
0.2118673236 


0.2118673236 
0.2881326764 


0.3969913979 
0.1030086021 


0.1030086021 
0.3969913979 
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TABLE VIII 
Quantum Predictions Optimized Bell 



Qab(X=X 


Y=y) 


x = J 


a=l 


x=F 


x = T 


£1 = 2 

x = F 


b=l 


y = T 
y = F 


0.5 







0.5 


0.375 
0.125 


0.125 
0.375 


b = 2 


y = T 
y = F 


0.125 
0.375 




0.375 
0.125 


0.375 
0.125 


0.125 
0.375 



TABLE IX 

Best Classical Theory for Uniform Settings Optimized Bell. KL Distance: 0.0177632822. 



Pab(X = 


= x,Y 


= x) 


a = 

x = T 


1 

x=F 


a 

x = T 


= 2 

x=F 


b = 


i y 

y 


= T 
= F 


0.5000000000 
0.0000000000 


0.0000000000 
0.5000000000 


0.3333333333 
0.1666666667 


0.1666666667 
0.3333333333 


b = 


2 y 

y 


= T 
= F 


0.1666666667 
0.3333333333 


0.3333333333 
0.1666666667 


0.3333333333 
0.1666666667 


0.1666666667 
0.3333333333 



TABLE X 

Optimized Uncorrelated Setting Distribution Optimized Bell 



p r (A = a,B = b) = a ab e Z uc 


0=1 


£1 = 2 


Pr(B = b) 


b= 1 


0.1497077788 


0.2372131160 


0.3869208948 


b = 2 


0.2372131160 


0.3758659893 


0.6130791052 


Pr(A = a) 


0.3869208948 


0.6130791052 





TABLE XI 

Best Classical Theory for Uncorrelated Settings Optimized Bell. KL Distance: 0.0191506613. 



P ah (X = x,Y = x) 


a — 

x = T 


1 

x=F 


a 

x = T 


= 2 

x=F 


b=\ y= 

y = 


T 
F 


0.5000000000 
0.0000000000 


0.0000000000 
0.5000000000 


0.3267978563 
0.1732021436 


0.1732021436 
0.3267978563 


b =2 y = 

y = 


T 
F 


0.1732021436 
0.3267978563 


0.3267978563 
0.1732021436 


0.3464042873 
0.1535957127 


0.1535957127 
0.3464042873 



TABLE XII 

Optimized Correlated Setting Distribution Optimized Bell 



Pr(A = a,B = b) = a ab eY. 


£1=1 


£1 = 2 


b=\ 


0.1046493146 


0.2984502285 


b = 2 


0.2984502285 


0.2984502285 



TABLE XIII 

Best Classical Theory for Correlated Settings Optimized Bell. KL Distance: 0.0211293952. 



Pab(X = 


= x,Y 


= x) 


£1 = 

x = T 


1 

x=F 


£1 

x = T 


= 2 

x=F 


b = 


i y 

y 


= T 

= F 


0.4927305107 
0.0072694892 


0.0072694892 
0.4927305107 


0.3357564964 
0.1642435036 


0.1642435036 
0.3357564964 


b = 


2 y 

y 


= T 

= F 


0.1642435036 
0.3357564964 


0.3357564964 
0.1642435036 


0.3357564964 
0.1642435036 


0.1642435036 
0.3357564964 
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TABLE XIV 
Quantum Predictions CHSH 



Q ab {X=x,Y=y) 


a= 1 

x=T x=F 


o = 2 

x=T x=F 


b = 1 ,- 

v = F 


0.4267766953 0.0732233047 
0.0732233047 0.4267766953 


0.4267766953 0.0732233047 
0.0732233047 0.4267766953 


9 >' = T 
h=1 y=F 


0.4267766953 0.0732233047 
0.0732233047 0.4267766953 


0.0732233047 0.4267766953 
0.4267766953 0.0732233047 


Best Classi 
p flfc (x = x,y =x) 


TABLE XV 
cal Theory for Uniform Settings CL 

a= 1 

x=T x=F 


1SH. KL Distance: 0.0462738469. 

o = 2 

x = T x — F 


b =x y=l 

y= F 


0.375 0.125 
0.125 0.375 


0.375 0.125 
0.125 0.375 


, = 2 * = T 
>'= F 


0.375 0.125 
0.125 0.375 


0.125 0.375 
0.375 0.125 



D. Hardy 

The bipartite state a|0 A B ) - j8 1 1 A 1 B ), with a := + 2v^l3 + 6V5 « 0.907 and j8 := Vl - a 2 « 0.421 (such that 
indeed a 2 +/3 2 = 1). A's and B's measurement settings are now identical and given by: 

pr-u rt -|r-D„=-^|Q + ^|i>, W) 

IX = T)„ 2 = \Y = T >(rf := -^jfjjjIO) + yC^TlO. (39) 

With these settings, quantum mechanics predicts the conditional probabilities of Table IXVII 

(1) Uniform Settings, Hardy: For uniform measurement settings, the best classical theory to describe the quantum mechanical 
statistics is given in Table IX V 111 with KL divergence: 0.0278585182. 

(2) Uncorrelated Settings, Hardy: The optimized uncorrected setting distribution is given in Table IX Villi The probabilities 
of the best classical theory for this distribution are in Table IxTxl The corresponding KL distance is 0.0279816333. 

(3) Correlated Settings, Hardy: The optimized correlated setting distribution is given in Table IX*Xl The probabilities of the 
best classical theory for this distribution are described in Table [XXTI The corresponding KL distance is 0.0280347655. 

E. Mermin 

In [25], we find the following nonlocality proof with two parties, three measurement settings, and two possible outcomes. 
Let the entangled state be -L(|0a0b) + |1aLs))-. and the measurement settings: 



x 


= T) fl=1 = 


\y 


= T) 6=1 


= |o>, 


x 


= T) a=2 - 


\y 


= T) h=2 


= !*(§*)> 


x 


= T) a=3 = 


\Y 


= T) b=3 


= !*($*)> 



With these settings, quantum mechanics predicts the conditional probabilities of Table IXXIII 

(1) Uniform Settings, Mermin: The probabilities of the best classical theory for the uniform measurement settings is give 
in Table IXXlTTl 

(2) Uncorrelated Settings, Mermin: The optimal uncorrelated setting distribution is given in Table IXXIVI The probabilities 
of the best classical theory for this distribution is in Table IXXVI 

(3) Correlated Settings, Mermin: The optimal correlated setting distribution is given in Table |X"XVII (note that there are 
also other optimal distributions). The probabilities of the best classical theory for this specific distribution are described in 
Table IXXVIll The corresponding KL distance is 0.0211293952. 

F. GHZ 

The tripartite state J^|0a0b0c) + -^|1a1b1c)- The settings for all three parties are identical: 

\X = T)„ =1 = \Y = T) i=1 = |Z = T) c=1 := ^|0> + ^|1>, (40) 
\X = T) a =2 = \Y = T) b=2 = \Z = T) c=2 : = l |0) + i (41) 
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TABLE XVI 
Quantum Predictions Hardy 



Qab(X=X 


Y 


= y) 


a — 

x = J 


1 

x=F 


a 

x = T 


= 2 

x = F 


b=l 


y 
y 


= T 
= F 




0.38196601125 


0.38196601125 
0.23606797750 


0.23606797750 



0.14589803375 
0.61803398875 


b = 2 


y 
y 


II II 
-n H 


0.23606797750 
0.14589803375 




0.61803398875 


0.09016994375 
0.14589803375 


0.14589803375 
0.61803398875 



TABLE XVII 

Best Classical Theory for Uniform Settings Hardy. KL Distance: 0.0278585182. 



Pab(X = 


= x,Y 


= x) 


a = 

x = T 


1 

x= F 


a 

x = T 


= 2 

x=F 


b = 


i y 

y 


= T 
= F 


0.0338829434 
0.3543640363 


0.3543640363 
0.2573889840 


0.2190090188 
0.0075052045 


0.1692379609 
0.6042478158 


b = 


2 y 

y 


= T 
= F 


0.2190090188 
0.1692379609 


0.0075052045 
0.6042478158 


0.0488933524 
0.1776208709 


0.1776208709 
0.5958649058 



TABLE XVIII 

Optimized Uncorrelated Setting Distribution Hardy 



Pr(A = a,B = b) = a ab e E uc 


0=1 


a = 2 


Pr(B = b) 


6=1 


0.2603092699 


0.2498958554 


0.5102051253 


b = 2 


0.2498958554 


0.2398990193 


0.4897948747 


Pr(A = a) 


0.5102051253 


0.4897948747 





TABLE XIX 

Best Classical Theory for Uncorrelated Settings Hardy. KL Distance: 0.0279816333. 



P ah (X = x,Y = x) 


a — 

x = T 


1 

x=F 


a 

x = T 


= 2 

x = F 


b=\ y = 

y = 


T 
F 


0.0198831449 
0.3612213769 


0.3612213769 
0.2576741013 


0.2143180373 
0.0141212511 


0.1667864844 
0.6047742271 


b =2 y = 

y = 


T 
F 


0.2143180373 
0.1667864844 


0.0141212511 
0.6047742271 


0.0481256471 
0.1803136414 


0.1803136414 
0.5912470702 



TABLE XX 

Optimized Correlated Setting Distribution Hardy 



Pr(A = a,B = b) = a ab eY. 


£1=1 


£1 = 2 


b=\ 


0.2562288294 


0.2431695652 


b = 2 


0.2431695652 


0.2574320402 



TABLE XXI 

Best Classical Theory for Correlated Settings Hardy. KL Distance: 0.0280347655. 



Pab(X = 


= x,Y 


= x) 


£1 = 

x = T 


1 

x=F 


£1 

x = T 


= 2 

x=F 


b = 


i y 

y 


= T 

= F 


0.0173443545 
0.3620376608 


0.3620376608 
0.2585803238 


0.2123471649 
0.0165954828 


0.1670348504 
0.6040225019 


b = 


2 y 

y 


= T 

= F 


0.2123471649 
0.1670348504 


0.0165954828 
0.6040225019 


0.0505353201 
0.1784073276 


0.1784073276 
0.5926500247 



TABLE XXII 
Quantum Predictions Mermin 



Q ab (X = x,Y =y) 


a= 1 
x=T x=F 


a = 2 
x=T x=F 


£1 = 3 

x=T x=F 




0.5 
0.5 


0.125 0.375 
0.375 0.125 


0.125 0.375 
0.375 0.125 


6 = 2 y =l 

y= f 


0.125 0.375 
0.375 0.125 


0.5 
0.5 


0.125 0.375 
0.375 0.125 


6 = 3 y = l 
y=F 


0.125 0.375 
0.375 0.125 


0.125 0.375 
0.375 0.125 


0.5 
0.5 



TABLE XXIII 

Best Classical Theory for Uniform Settings Mermin. KL distance: 0.0157895843. 



P ab (X=x,Y =y) 


a= 1 
x=T x=F 


£1 = 2 

x=T x=F 


£1 = 3 

x=T x=F 


*=i y =l 

y=F 


0.50000 0.00000 
0.00000 0.50000 


0.16667 0.33333 
0.33333 0.16667 


0.16667 0.33333 
0.33333 0.16667 


b=2 y =l 
y=F 


0.16667 0.33333 
0.33333 0.16667 


0.50000 0.00000 
0.00000 0.50000 


0.16667 0.33333 
0.33333 0.16667 


6 = 3 y =l 
y=F 


0.16667 0.33333 
0.33333 0.16667 


0.16667 0.33333 
0.33333 0.16667 


0.50000 0.00000 
0.00000 0.50000 



TABLE XXIV 

Optimized Uncorrelated Setting Distribution Mermin 





£1=1 


£1 = 2 


£1 = 3 


Pr(B = 6) 


b= 1 


0.1497077711 





0.2372131137 


0.3869208848 


b = 2 


0.2372131137 





0.3758660015 


0.6130791152 


b = 3 














Pr(A = a) 


0.3869208848 





0.6130791152 





TABLE XXV 

Best Classical Theory for Uncorrelated Settings Mermin. KL Distance: 0.0191506613. 



Pab{X -- 


= X 


Y = y) 


a = 

x = T 


= 1 

x=F 


£1 - 

x = T 


- 2 

x=F 


£1 

x = T 


= 3 

x=F 


b = 




y = T 

y=F 


0.50000 
0.00000 


0.00000 
0.50000 


0.50000 
0.50000 


0.00000 
0.00000 


0.17320 
0.32680 


0.32680 
0.17320 


6 = 


2 


y = T 
y=F 


0.17320 
0.32680 


0.32680 
0.17320 


0.50000 
0.50000 


0.00000 
0.00000 


0.15360 
0.34640 


0.34640 
0.15360 


6 = 


3 


y = T 
y=F 


0.50000 
0.00000 


0.50000 
0.00000 


1.00000 
0.00000 


0.00000 
0.00000 


0.50000 
0.00000 


0.50000 
0.00000 



TABLE XXVI 

Optimized Correlated Setting Distribution Mermin 



Pr(A = a,B = b) = o ab eI. 


£1=1 


£1 = 2 


£1 = 3 


6=1 


0.1046493071 





0.2984502310 


6 = 2 


0.2984502310 





0.2984502310 


6 = 3 












TABLE XXVII 

Best Classical Theory for Correlated Settings Mermin. KL Distance: 0.0211293952. 



P ab (X=x,Y=y) 


£1=1 

x = T x—F 


£1 = 2 

x = T x = F 


£1 = 3 

x = T x = F 


6=1 y =l 
y=F 


0.49273 0.00727 
0.00727 0.49273 


0.50000 0.00000 
0.50000 0.00000 


0.16424 0.33576 
0.33576 0.16424 


6=2 y =l 
y=F 


0.16424 0.33576 
0.33576 0.16424 


0.50000 0.00000 
0.50000 0.00000 


0.16424 0.33576 
0.33576 0.16424 


6=3 y =l 
y=F 


0.50000 0.50000 
0.00000 0.00000 


1.00000 0.00000 
0.00000 0.00000 


0.50000 0.50000 
0.00000 0.00000 
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TABLE XXVIII 
Quantum Predictions GHZ 



Qabc(X=x,Y =y,Z = z) 


0=1 
jr = T -v = F 


o = 2 
x = T x = F 


... - T r-J 


0.25 
0.25 
0.25 

0.25 


0.125 0.125 
0.125 0.125 
0.125 0.125 
0.125 0.125 


7-F V = T 
Z " F y=F 


0.125 0.125 
0.125 0.125 
0.125 0.125 
0.125 0.125 


0.25 
0.25 
0.25 

0.25 


... - T r-J 


0.125 0.125 
0.125 0.125 
0.125 0.125 
0.125 0.125 


0.25 
0.25 
0.25 

0.25 


Z=F y=F 


0.25 
0.25 
0.25 

0.25 


0.125 0.125 
0.125 0.125 
0.125 0.125 
0.125 0.125 



TABLE XXIX 

Best Classical Theory for Uniform Settings GHZ. KL Distance: 0.2075187496. 



P abc {X=x,Y =y,Z = z) 


0=1 

x = T x = F 


o = 2 
x=T x=F 


... - t H 

... " f - 


0.1875 0.0625 
0.0625 0.1875 
0.0625 0.1875 
0.1875 0.0625 


0.125 0.125 
0.125 0.125 
0.125 0.125 
0.125 0.125 


Z=F 


0.125 0.125 
0.125 0.125 
0.125 0.125 
0.125 0.125 


0.0625 0.1875 
0.1875 0.0625 
0.1875 0.0625 
0.0625 0.1875 


... - T r-J 


0.125 0.125 
0.125 0.125 
0.125 0.125 
0.125 0.125 


0.0625 0.1875 
0.1875 0.0625 
0.1875 0.0625 
0.0625 0.1875 


c y= T 

Z=F y=? 


0.0625 0.1875 
0.1875 0.0625 
0.1875 0.0625 
0.0625 0.1875 


0.125 0.125 
0.125 0.125 
0.125 0.125 
0.125 0.125 



With these settings, quantum mechanics predicts the conditional probabilities of Table IXXVIIII 

(1) Uniform and Uncorrelated Settings, GHZ: For all three settings, the best possible classical statistics that approximate 
the GHZ experiment is that of Table IXXIXI The optimal uncorrelated setting is the uniform settings that samples all eight 
measurement settings with equal probability. The corresponding KL divergence is: 0.2075187496. 

(2) Correlated Settings, GHZ: The optimal correlated setting samples with equal probability those four settings that yield 
the (0.125,0) outcome probabilities (those are the settings where an even number of the measurements are measuring along 
the m\ axis). The KL divergence in this setting is twice that of the previous uniform setting: 0.4150374993. 

Appendix IV 
The Kullback-Leibler Divergence 

This appendix provides in-depth information about the Kullback-Leiber divergence and its relation to statistical strength. 
Appendix IIV-AI gives the formal connection between KL divergence and statistical strength. Appendix IIV-BI discusses some 
general properties of KL divergence. Appendix IIV-CI compares it to variation distance. Appendix IIV-DI informally explains 
why KL divergence is related to statistical strength. 

A. Formal Connection between KL Divergence and Statistical Strength 

We consider three methods for statistical hypothesis testing: frequentist hypothesis testing [30], Bayesian hypothesis [23] 
testing and information-theoretic hypothesis testing [24], [31]. Nearly all state-of-the-art, theoretically motivated statistical 
methodology falls in either the Bayesian or the frequentist categories. Frequentist hypothesis testing is the most common, the 
most taught in statistics classes and is the standard method in, for example, the medical sciences. Bayesian hypothesis testing 
is becoming more and more popular in, for example, econometrics and biological applications. While theoretically important, 
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the information-theoretic methods are less used in practice and are discussed mainly because they lead to a very concrete 
interpretation of statistical strength in terms of bits of information. 

We illustrate below that in all three approaches the KL divergence indeed captures the notion of 'statistical strength'. We 
consider the general situation with a sample Z\,Z2,..., with the Z, independently and identically distributed according to some 
Qa, Qa being some distribution over some finite set Z. For each n, the first n outcomes are distributed according to the 
n-fold product distribution of Q a , which we shall also refer to as Q a . Hence Q a {z\, ■ ■ ■ ,z n ) — I\" = i Qa{zi)- The independence 
assumption also induces a distribution over the set Z°° of all infinite sequences 3 which we shall also refer to as Q a - 

We test Q a against a set of distributions V a . Thus, Q a and V a may, but do not necessarily refer to quantum and local 
realist theories — the statements below hold more generally. 

1) Frequentist Justification: In frequentist hypothesis testing, V a is called the null-hypothesis and Q a the alternative 
hypothesis. Frequentist hypothesis testing can be implemented in a number of different ways, depending on what statistical 
test one adopts. A statistical test is a procedure that, when input an arbitrary sample of arbitrary length, outputs a decision. 
The decision is either 'Q a generated the data' or "Pa generated the data'. Each test is defined relative to some test statistic 
T and critical value c. A test statistic T is a function defined on samples of arbitrary length, that for each sample outputs a 
real number. Intuitively, large values of the test statistic indicate that something has happened which is much more unlikely 
under any of the distributions in the null-hypothesis than under the alternative hypothesis. A function that is often used as a 
test statistic is the log-likelihood ratio 

Tt„ -v \ ■ Qejzi, ■ ■ ■ ,z„) 

T(zi,...,Z n ) •= —, r, (42) 

sup PeVa P(zi,...,Z„) 

but many other choices are possible as well. 

The critical value c determines the threshold for the test's decision: if, for the observed data zi,...,z„, it holds that 
T(z\, ■ ■ ■ ,z„) > c, the test says 'Q a generates the data'; if T(zi, . . . ,z„) <c, the test says "P CT generated the data'. 

The confidence in a given decision is determined by a quantity known as the p-value. This is a function of the data that was 
actually observed in the statistical experiment. It only depends on the observed value of the test statistic f bserved : = T(z\ , . . . ,z„). 
It is defined as 

p-value := sup P(T(Z U . . . ,Z n ) > f bserved)- (43) 

PeVa 

Here the Z\, . . . ,Z„ are distributed according to P and thus do not refer to the data that was actually observed in the experiment. 
Thus, the p-value is the maximum probability, under any distribution in Va, that the test statistic takes on a value that is at 
least as extreme as its actually observed outcome. Typically, the test is defined such that the critical value c depends on sample 
size n. It is set to the value cq such that the test outputs 'g<r' iff the p-value is smaller than some pre-defined significance 
level, typically 0.05. 

Large p-values mean small confidence: for example, suppose the test outputs Q a whenever the p-value is smaller than 0.05. 
Suppose further that data are observed with a p-value of 0.04. Then the test says "Qa" but since the p-value is large, this 
is not that convincing to someone who considers the possibility that some P 6 V a has generated the data: the large p-value 
indicates that the test may very well have given the wrong answer. On the other hand, if data are observed with a p-value of 

0. 001, this gives a lot more confidence in the decision output by the test. 

We call a test statistic asymptotically optimal for identifying Q a if, under the assumption that Q a generated the data, the 
p-value goes to at the fastest possible rate. Now let us assume that Q a generates the data, and an optimal test is used. A well- 
known result due to Bahadur [2, Theorem 1] says that, under some regularity conditions on Q a and V a , with Q a -probability 

1, for all large n, 

p-value = 2-" D (e-ll p -)+"M. (44) 

where lim„^ooo(n) jn = 0. We say 'the p-value is determined, to first order in the exponent, by D(Q a \\V a )' . Note that what 
we called the 'actually observed test statistic fobserved' in d43l > has become a random variable in (144b . distributed according to 
Q a . It turns out that the regularity conditions, needed for Equation 1441 to hold, apply when Q a is instantiated to a quantum 
theory Q with measurement setting distributions a, and V a is instantiated to the corresponding set of LR theories as defined 
in Section ITT1 

Now imagine that QM, who knows that Q a generates the data, wonders whether to use the experimental setup corresponding 
to (7i or (72. Suppose that D{Q a{ \\Vo l ) > D(Q ai \\'P a2 ). It follows from Equation (1441 that if the experiment corresponding to 
(7i is performed, the p-value will go to exponentially faster (in the number of trials) than if the experiment corresponding to 
O2 is performed. It therefore makes sense to say that 'the statistical strength of the experiment corresponding to <7i is larger 
than the strength of <72 - This provides a frequentist justification of adopting D(Q a \\P a ) as an indicator of statistical strength. 

3 Readers familial' with measure theory should note that throughout this paper, we tacitly assume that Z°° is endowed with a suitable ff-algebra such that 
all sets mentioned in this paper become measurable. 
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Remark: Bahadur [2, Theorem 2] also provides a variation of Equation (1441 . which (roughly speaking) says the following: 
suppose Q a generates the data. For e > 0, let N £ be the minimum number of observations such that, for all n>N £ , the test 
rejects To (if V<y is not rejected for infinitely many n, then N £ is defined to be infinite). Suppose that an optimal (in the sense 
we used previously) test is used. Then, for small e, N £ is inversely proportional to D(Q a \\P a )'- with Q a -probability 1, the 
smaller D(Q a \\'Pa), the larger N £ . If a 'non-optimal' test is used, then N £ can only be larger, never smaller. 

The rate at which the />value of a test converges to is known in statistics as Bahadur efficiency. For an overview of the 
area, see [17]. For an easy introduction to the main ideas, focusing on 'Stein's lemma' (a theorem related to Bahadur's), see 
[4, Chapter 12, Section 8]. For an introduction to Stein's lemma with a physicist audience in mind, see [3]. 

2) Bayesian Justification: In the Bayesian approach to hypothesis testing [7], [23], when testing Q a against V a , we must 
first determine an a priori probability distribution over Q a and V c - This distribution over distributions is usually just called 
'the prior'. It can be interpreted as indicating the prior (i.e., before seeing the data) 'degree of belief in Q a vs. V a . It is 
often used to incorporate prior knowledge into the statistical decision process. In order to set up the test as fairly as possible, 
QM and LR may agree to use the prior Pr(2 CT ) = Pi^a) = 1/2 (this should be read as 'the prior probability that Q a obtains 
is equal to the prior probability that some P £ V a obtains'). Yet as long as Pi(Q a ) > and there is a smooth and positive 
probability density for P a G To, the specific values for the priors will be irrelevant for the result below. 

For given prior probabilities and a given sample zi,...,z„, Bayesian statistics provides a method to compute the posterior 
probabilities of the two hypotheses, conditioned on the observed data: Pi(Q a ) is transformed into Pr(Q a \ zi, ■ ■ ■ ,Z n ). Similarly, 
Pr^cj) is transformed to Pi^Va | Z\ , ■ ■ ■ ,Z«). One then adopts the hypothesis H E {Qa^a} with the larger posterior probability 
Pr(H | z\ , ■ ■ ■ ,z n )- The confidence in this decision is given by the posterior odds of Q a against V a , defined, for given sample 
Zi,...,Z„, as 

post-odds(Q CT ,P CT ) := — — -. (45) 

Pi{V a |zi,...,z„) 

The larger post-odds, the larger the confidence. Now suppose that data are distributed according to Q a . It can be shown 
that, under some regularity conditions on Q a and To, with Q a -probability 1, 

post-odds = 2" D (e*ll^)+o(i°g«), (46) 

In our previously introduced terminology, 'the Bayesian confidence (posterior odds) is determined by (Q a \\'P a ), up to first order 
in the exponent'. We may now reason exactly as in the frequentist case to conclude that it makes sense to adopt D(Q a \\P a ) 
as an indicator of statistical strength, and that it makes sense for QM to choose the setting probabilities a so as to maximize 

D(Qo\\Ve). 

Equation J46I is a 'folklore result' which 'usually' holds. In Appendix IIV-EI we show that it does indeed holds with Q a 
and Va defined as nonlocality proofs and local realist theories, respectively. 

3) Information-Theoretic Justification: There exist several approaches to information-theoretic or compression-based hypoth- 
esis testing; see, for example, [4], [24]. The most influential of these is the so-called Minimum Description Length Principle 
[31]. The basic idea is always that the more one can compress a given sequence of data, the more regularity one has extracted 
from the data, and thus, the better one has captured the 'underlying regularities in the data'. Thus, the hypothesis that allows 
for the maximum compression of the data should be adopted. 

Let us first consider testing a simple hypothesis Q against another simple hypothesis P. Two basic facts of coding theory 
say that 

1) There exists a uniquely decodeable code with lengths Lq that satisfy, for all zi, • • • ,Z n € Z", 

L Q ( Zh ...,Zn)= r-l0gg(zi,...,Z„)l. (47) 

The code with lengths Lq is called the Shannon-Fano code, and its existence follows from the so-called Kraft Inequality, 
[10]. 

2) If data Z\, . . . ,Z„ are independently identically distributed ~ Q, then among all uniquely decodeable codes, the code with 
length function Lq has the shortest expected code-length. That is, let L be the length function of any uniquely decodeable 
code over n outcomes, then 

Eq[L(Zj ,...,Z n )}> E G [-log<2(Zi,. . . ,Z„)]. (48) 

Thus, under the assumption that Q generated the data, the optimal (maximally compressing) code to use will be the Shannon- 
Fano code with lengths — logQ(Z") (here, as in the remainder of this section, we ignored the integer requirement for code 
lengths). Similarly, under the assumption that some P with P ^ Q generated the data the optimal code will be the code with 
lengths — logP(Z"). Thus, from the information-theoretic point of view, if one wants to find out whether P or Q better explains 
the data, one should check whether the optimal code under P or the optimal code under Q allows for more compression of 
the data. That is, one should look at the difference 



bit-dif f := -\ogP(zi, . . . ,z„) - [-log<2(zi,...,z„)]. 



(49) 
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If bit-dif f > 0, then one decides that Q better explains the data. The confidence in this decision is given by the magnitude 
ofbit-diff: the larger bit-dif f, the more extra bits one needs to encode the data under P rather than Q, thus the larger 
the confidence in Q. 

Now suppose that Q actually generates the data. The expected code length difference, measured in bits, between coding the 
data using the optimal code for Q and coding using the optimal code for P, is given by Eg[— logP(Z") — [— logQ(Z")]] = 
nD(Q\\P). Thus, the KL divergence can be interpreted as the expected additional number of bits needed to encode outcomes 
generated by Q, if outcomes are encoded using a code that is optimal for P rather than for Q . Thus, the natural 'unit' of 
Z>(- 1| • ) is the 'bit', and D(Q\\P) may be viewed as 'average amount of information about Z that is lost if Z is wrongfully 
regarded as being distributed by Q rather than P' . By the law of large numbers, Equation i49i implies that, with Q-probability 
1, as n — ► oo, 

-(bit-dif f) ->D(Q\\P). (50) 
n 

Thus, if Q generates the data, then the information-theoretic confidence bit-dif f in decision "Q explains the data better 
than P" is, up to first order, determined by the KL divergence between Q and P: the larger D(Q\\P), the larger the confidence. 
This gives an information-theoretic justification of the use of the KL divergence as an indicator of statistical strength for simple 
hypothesis testing. We now turn to composite hypothesis testing. 

Composite Hypothesis Testing: If one compares Q a against a set of hypotheses V a , then one has to associate V a with 
a code that is 'optimal under the assumption that some P 6 V a generated the data'. It turns out that there exist codes with 
lengths L-p satisfying, for all zi, ■ ■ ■ ,z„ E Z", 

Lp a (zu...,Zn)< irrf -logP(zi,...,z n ) + 0(logn). (51) 

An example of such a code is given in Appendix IIV-FI The code L-p a is optimal, up to logarithmic terms, for whatever 
distribution P E V c that might actually generate data. The information theoretic approach to hypothesis testing now tells us 
that, to test Q a against V a , we should compute the difference in code lengths 

bit-diff :=Lp a {zi,...,z„)-[-logQ a (zi,---,z„)]. (52) 

The larger this difference, the larger the confidence that Q a rather than Va generated the data. In Appendix IIV-FI we show 
that, in analogy to Equation J50I . as n — > oo, 

i(bit-diff)->DG2 ff ||7> ff ) (53) 
n 

Thus, up to sublinear terms, the information-theoretic confidence in Q a is given by nD(Q a \\V a ). This provides an information- 
theoretic justification of adopting D(Q a \\V a ) as an indicator of statistical strength. 



B. Properties of Kullback-Leibler Divergence 

1) General Properties: Let V be the set of distributions on Z. We equip V with the Euclidean topology by identifying 
each P 6 V with its probability vector. Then D(P\\Q) is jointly continuous in P and Q on the interior of V . It is jointly lower 
semicontinuous (for a definition see, e.g., [32]), but not continuous, on V x V. It is also jointly strictly convex on V x V. 

Because Q(z)\ogQ(z) — as Q(z) j we can ignore the Q(z) = parts in the summation, and hence 

D(Q\\P)= £ Qiz) [-logP(z)+ log Q(z)}. (54) 

z£Z 

2(z)>o 

2) The Additivity Property: The KL divergence has the following additivity property. Let X and y be finite sample spaces, 
and let P and Q be distributions over the product space X x y. Let Px (Qx) denote the marginal distribution of P (Q) over 
X, and for each x E X, let P Y \ X (Qy\ x ) denote the conditional distribution over 3^ conditioned on X =x, i.e. Py\ x (y) : =P(y\ x ) 
for all y Ey. Then 

D(Q\\P) = £ Q(x)D(Q Y]x \\P Y]x ) +D(Q X \\P X ) (55) 

xex 

= E Qx [D(Q Ylx \\P Ylx )] +D(Q X \\P X ). (56) 

An important consequence of this property is that the divergence between the joint distribution of n independent drawings 
from Q to that of n independent drawings from P, is n times the divergence for one drawing. It also implies Equation Jl 31 in 
Section HVl 
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C. Kullback-Leibler versus Total Variation Distance 

In discussions about the strengths of nonlocality proofs, it has sometimes been claimed that QM should use the filter settings 
that give the largest deviation in the Bell inequality. This would mean that QM should try to set up the experiment such that 
the distribution of outcomes Q is as distant as possible to LR's distribution over outcomes P where distance is measured by 
the so-called total variation distance, [12] between Q and P, defined as Yizez \P( Z ) ~~ Q( z )\- While it is true that this defines a 
distance between probability distributions, it is only one of large number of possible distances or divergences that can be defined 
for probability distributions. But if one is interested in measuring 'statistical distance', total variation is not the appropriate 
distance measure to use. Instead, one should use the KL divergence. To get some feel for how different KL and total variation 
can be, let Z = {1,2} and consider the following possibilities for P and Q: 

1) P(l) = 0.99 and Q(l) = 1. Then the absolute difference in probabilities between P and Q is very small (0.02); however, 
if data are sampled from P, then, with high probability, after a few hundred trials we will have observed at least one 
0. From that point on, we are 700% certain that P, and not Q, has generated the data. This is reflected by the fact that 
D(P\\Q) =oo. 

2) Let P and Q be as above but consider D(Q\\P). We have D(Q\\P) = -1 • log0.99 = 0.015. This illustrates that, if Q 
rather than P generates the data, we typically need an enormous amount of data before we can be reasonably sure that 
Q indeed generated the data. 

3) P{\) = 0.49, 0(1) = 0.5. In this case, D(P||Q) = 0.49 log0.98 + 0.51 log 1. 02 « 0.000289 and D{Q\\P) = 0.5(-log0.98 - 
log 1.02) 0.000289. Now the average support per trial in favor of Q under distribution Q is about equal to the average 
support per trial in favor of P under P. 

4) Note that the KL divergences for the 'near uniform' distributions with P(l),Q(l) as 0.5 is much smaller than the 
divergences for the skewed distributions with P(1),<2(1) « 1, while the total variation distance is the same for all these 
distributions. 

The example stresses the asymmetry of KL divergence as well as its difference from the absolute deviations between proba- 
bilities. 



D. Intuition behind it all 

Here we give some intuition on the relation between KL divergence and statistical strength. It can be read without any 
statistical background. Let Z\,Zi,... be a sequence of random variables independently generated either by some distribution 
P or by some distribution Q with Q^P. Suppose we are given a sample (sequence of outcomes) z\ , ■ ■ . ,Z«. Perhaps simplest 
(though by no means only) way of finding out whether Q or P generated this data is to compare the likelihood (in our case, 
'likelihood' = 'probability') of the data zi, ■ ■ ■ ,Zn according to the two distributions. That is, we look at the ratio 



Q(zi,...,z„) FEU G(zO 



(57) 



P{zi,...,Zn) UUHzi)' 

Intuitively, if this ratio is larger than 1, the data is more typical for Q than for P, and we might decide that Q rather than P 
generated the data. Again intuitively, the magnitude of the ratio in Equation d57i might give us an idea of the confidence we 
should have in this decision. 

Now assume that the data are actually generated according to Q, i.e. 'Q is true'. We will study the behavior of the logarithm 
of the likelihood ratio in Equation ( I57> under this assumption (the use of the logarithm is only to simplify the analysis; using 
Equation J57l > directly would have led to the same conclusions). The Law of Large Numbers [11] tells us that, with g-probability 
1, averages of bounded random variables will converge to their (P-expectations. In particular, if the Z\ take values in a finite 
set Z, and P and Q are such that P(z),Q(z) > for all z € Z, then with ^-probability 1, 

-£L,-^E e [L] (58) 

where L; := log(g(Z,)/P(Z,)), and Eg[L] = Eg[Li] = • • • = Eq[L„] is given by 

E Q [L] = E e [log(|)] =£Q(z)log(||j)=Z)(<2||/'). (59) 

Therefore, with g-probability 1, 

Uo g f^---^n) Dmp) m 

n P{Z u ...,Z n ) 

Thus, with ^-probability 1, the average log-likelihood ratio between P and Q will converge to the KL divergence between P 
and Q. This means that the likelihood ratio, which may be viewed as the amount of evidence for Q vs. P, is asymptotically 
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determined by the KL divergence, to first order in the exponent. For example, let us test Q first against Pi with D(Q\\P\) = E\, 
and then against P2 with D(Q\\P2) = £2 > £1, then, with ^-probability 1, 

1 e(Z!,... ; Z„) 1 Q(Z h ...,Z n ) 

~ lo ^T/7 — — > £i and - log — — —- > £2- (61) 

n P 1 {Z l ,...,Z n ) n P 2 {Z u ...,Z n ) 

This implies that with increasing «, the likelihood ratio Q/P\ becomes exponentially smaller than the likelihood ratio Q/P2' 

Q(Zi,...,Z„) < Q{Z\,...,Z„) & -n(£ 2 -£i)+o(n) ,^2) 

p l {z u ...,z n ) - P l {Z 2 ,...,Z n )' 

Returning to the setting discussed in this paper, this preliminary analysis suggests that from QM's point of view (who knows 
that Q is true), the most convincing experimental results (highest likelihood ratio of Q a vs. P a -,n) are obtained if the KL 
divergence between Q a and P a n is as large as possible. If Q a is compared against a set V a , then the analysis suggests that 
the most convincing experimental results are obtained if the KL divergence between Q a and V a is as large as possible, that 
is, if inf^g-p^ D(Q a \\P a ) is as large as possible. 

E. Bayesian Analysis 

In this appendix we assume some basic knowledge of Bayesian statistics. We only give the derivation for the 2x2x2 
nonlocality proofs. Extension to generalized nonlocality proofs is straightforward. 

Let us identify H\ := Q a and Ho := V a , where Q a and V a are defined as quantum and local realist theories respectively, 
as in Section HJ We start with a prior Pr on H\ and Hq, and we assume < Pr(Hi) < 1. 

Now, conditioned on Hq being the case, the actual distribution generating the data may still be any P C K £ Hq. To indicate 
the prior degree of belief in these, we further need a conditional prior distribution Pr(-|Ho) over all the distributions in Ho. 
Since a is fixed, Hq is parameterized by the set n. We suppose the prior Pr(-|H)) is smooth in the sense of having a 

probability density function w over %, so that for each (measurable) A C II, 

Pi{{Po;k ■ * G A} I H ) := [ w{n)dn. (63) 

JneA 

We restrict attention to prior densities w(-) that are continuous and uniformly bounded away from 0. By the latter we mean 
that there exists w m \ n > such that w(n) > w m \ B for all n £ II. For concreteness one may take w to be uniform (constant over 
7T), although this will not affect the analysis. 

In order to apply Bayesian inference, we further have to define Pr(zi , . . . ,z„|H,), 'the probability of the data given that H, 
is true'. We do this in the standard Bayesian manner: 

Pr(z 1 ,...,z„|H 1 ) :=Qa(zi,...,z„), 



Pr(zi,...,z„|H ) := / P a - K {z\, . . . ,z n )w{n)dn. (64) 

Here each outcome z, consists of a realized measurement setting and an experimental outcome in that setting; hence we can 
write zi = (ai,bi,Xi,yi) for a u bi £ {1,2} and Xi,y t £ {T, F}. 

Together with the prior over {Hi, Ho}, Equation (I64> defines a probability distribution over the product space {Hi, Ho} x Z" 
where Z := {1,2} x {1,2} x {T, F} x {T,F}. Given experimental data zi,...,z B and prior distribution Pr, we can now use 
Bayes' rule [11], [23] to compute the posterior distribution of H,: 

pmi ^ Pr(z 1 ,...,z„|H)Pr(H,-) 

Pr(H/zi,...,z„) = — — ; (65) 

£ i -Pr(zi,...,z„|H / )Pr(Hi) 



According to Bayesian hypothesis testing, we should select the H, maximizing the posterior probability of Equation (1651 . The 
confidence in the decision, which we denote by post-odds, is given by the posterior odds against Ho: 

PrC^Izi,...^) 

post-odds := ^7777-] r (66) 



Pr(H |zi,...,z„) 
Pr(zi,...,z„|H 1 )Pr(H 1 ) 
Pr(zi,...,z„|H )Pr(H ) 

Qa(z u ---,z„) Pr(Hi) 



(67) 
(68) 



Jxen p oAzi,- ■ ■ ,z n )w(7c)d7C Pr(H ) 
Note that post-odds depends on Ha, Hi and the data Zj,... ,z n - The factor on the left of Equation (I66> is called the Bayes 
factor, and the factor on the right is called the prior odds. Since the Bayes factor typically increases exponentially with n, the 
influence of the prior odds on the posterior odds is negligible for all but the smallest n. Below we show that, if Hi is true 
('QM is right'), then with probability 1, 

ilog(post-odds)^infD(e (7 ||P (J;s ). (69) 
n n 
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Hence Equation ( 1461 holds: the confidence post-odds will be determined, to first order in the exponent, by infYiD(Q a \\P a - J[ ). 
This gives a Bayesian justification of adopting D(Q a \\Va) as an indicator of statistical strength — provided that we can prove 
that Equation \69\ holds. We proceed to show this. 
Proof: To prove Equation ( 1691 . we first note that 

logPo-Azi > • • ■ >z«) = ^gP a - n ((a u b u xi ,yi),..., (a n ,b n ,x n ,y n )) (70) 
= n- £ P{a,b)a ab logP(a,b) + (71) 

a,be{l,2} 

n- £ P(a,b,x,y)log\ £ ■ ( 72 ) 

a,be{i.2\ \xiS2,yi,y2 I 

^6{T,F} \*a=XJ b =y I 

Here P(a,b) is the relative frequency (number of occurrences in the sample divided by n) of experimental outcomes with 
measurement setting (a,b) and P(a,b,x,y) is the relative frequency of experimental outcomes with measurement setting (a,b) 
and outcome X = x, Y = y. 

Let ft be any it achieving irifYiD(Q a \\P a ^ n ). By Theorem ^ suc h a n must exist, and Q a must be absolutely continuous 
with respect to P a ^(a,b,x,y). It follows that 

Po-Aa,b,x,y) = o ab £ ^l^yiw ( ? 3) 

may be equal to onfy if (7 fl fo2 a (x,y,fl,/7) = 0. From this it follows (with some calculus) that there must be a constant c and 
an e > such that if it with \% — %\\ < £, then, for all n and all sequences zi, ■ ■ ■ ,Z n with Q a (z\ ,z„) > 0, we have 

| i log P CT ;7C (z i , . . . , Zn ) - I log PaS (Z 1 , ■ ■ ■ , Zn ) 



\K-K\i 



< c (74) 



and whence \logP a -^(zi,---,Zn) — logP a; ji(zi,...,Zn)\ < nc\n — n\i <nce. For sufficiently large n, we find that e >n 2 and 
then 

Pt(z 1 ,...,Zn\H )= [ P a . w {zi,...,Zn)w(n)dn (75) 

> f P aiz {zi,...,Zn)w(n)d7i; (76) 

J \%— S|i<e 

> w mln • -Jt ' e~ c/n P<y;*(zi > • • • .Zn), (77) 

where v-n~ 2k is a lower bound on the volume J ldn of the set {% : |7T — jr|i < £ = n~ 2 }. Hence, — logPr(zi , . . . ,z n \Ho) < 
— logP<j ; s(zi, . . . ,z„) + 0(\ogn). By applying the strong law of large numbers to n Llogf <T; s(Z 1 ), we find that, with Q a - 
probability 1, 

--lo g Pr(Z 1; . . . ,Z„\H ) < E e „[-logP CT;S (Z)] + ) (78) 

n n 

= infE e J-logP CT ^(Z)]+0(^). (79) 
n n 

This bounds — ilogPr(Z"|//o) from above. We proceed to bound it from below. Note that for all n, zi,...,Z n , 

1 f 1 
— log/ P (T;7r (zi,...,z,,)w(<)d7r>inf — logPffjwCzi,. •-,*:,,)• (80) 

To complete the proof, we need to relate 

1 1 " 

inf--logP CT;3I (zi,...,z„) =inf — V logP^^z;) (81) 
n n n n 

i=\ 

(which depends on the data) to its 'expectation version' infnEg a [— logP<j ;?I (Z)]. This can be done using a version of the 
uniform law of large numbers [34]. Based on such a uniform law of large numbers, (for example, [18, Chapter 5, Lemma 
5.14]) one can show that for all distributions Q over Z, with Q-probability 1, as n— > °°, 

inf-ilog^Zi,. . . ,Z„) - infE G [-log^(Z)]. (82) 



Together, Equations d78i . J80I and d82t show that, with Q a -probability 1, as n — > °°, 



-i logPr(Z 1 , . . . ,Z„ \H ) - inf E Ca [- logf CT ^(Z)] (83) 
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TABLE XXX 
Generic Classical Distribution 



Pr ab (X = x,Y = x) 


0=1 

jc = T x=F 


a = 2 

jc = T x — F 




PI P2 
P3 P4 


P5 P6 
P7 P8 


b = 2 y =T 
y = f 


P9 PlO 
Pll P12 





Together with the law of large numbers applied to n £logPr(Z,-|Z/i), we find that 

1 Pr(Z 1 ,...,Z B |gi) . gq(Z) 

^ ii^AW " f 8 ' 1 ' ^ 1 - (84) 

Noting that the right hand side is equal to \-nfi\D{Q a \\P a - ll ) and plugging this in into Equation d66l . we see that indeed with 
//[-probability 1, Equation d69t holds. □ 



F. Information Theoretic Analysis 

In this appendix we assume that the reader is familiar with the basics of information theory. 

The code with lengths L-p a is simply the Shannon-Fano code for the Bayesian marginal likelihood Pr(zi,...,z„ | Va) = 
Pr(zi,...,z„ | Hq) as defined in Equation J64I . For each n, each zi,...,z„ £ Z n , this code achieves lengths (up to 1 bit) 
— logPr(zi, . . . ,z n | Va). The code corresponding to Q a achieves lengths — \ogQ(z\ , ■ ■ . ,Z„). We have already shown in Ap- 
pendix |^^] Equation ( I84l i. that, with ^-probability 1, as n — > °o, 

i[-logPr(zi,...,z„|P CT )-[-loge(zi,...,z„)]] ->D(Q a \\V a ). (85) 
n 

Noting that the left hand side is equal to bit-dif f/n, Equation ( I53> follows. 



Appendix V 
Proofs of Theorems [02 and [3] 

A. Preparation 

The proof of Theorem ^uses the following lemma, which is of some independent interest. 

Lemma 1: Let (L,H,U) be the game corresponding to an arbitrary 2 party, 2 measurement settings per party nonlocality 
proof. For any (ao,bo) £ { 1 , 2} 2 , there exists a % £ II such that for all (a,b) £ {1,2} 2 \ {(ao,£>o)} we have Q a y =Pab;n- Thus, 
for any three of the four measurement settings, the probability distribution on outcomes can be perfectly explained by a local 
realist theory. 

Proof: We give a detailed proof for the case that the measurement outcomes are two values {T, F}; the general case can 
be proved in a similar way. 

Without loss of generality let (ao,£>o) = (2,2). Now we must prove that the equation Q a b = P a b\n holds for the three settings 
(a,b) £ {(1, 1), (1,2), (2, 1)}. Every triple of distributions P a b;ii f° r these three settings may be represented by a table of the 

form of Table IXXXI with p\ ,...,/? 12 > and the normalization restrictions p\A h P4 = /?5 H h ps— P9 H \-P12 = 1- 

Given any table of this form, we say that the LR distribution P K corresponds to the p-table if Pi\-^(T,T) = p\, /^^(FjT) = piQ 
etc., for all p,. 

The no-signalling restriction implies that the realized measurement setting on A's side should not influence the probability 
on fi's side and vice versa. Hence, for example, Proo(T = T) = Pi"io(5 / = T), which gives p\ + p2= p5 + P6- In total there are 
four such no-signaling restrictions: 

P1+P2 = P5+P6 
< P3+P4 = Pl+PS 
P1+P3 = P9+PU 
P2+PA = PU)+Pl2- 

We call a table with pi,... ,pn > 0, that obeys the normalization restriction on the sub-tables and that satisfies Equations (I86> 
a Y-table. We already showed that each triple of conditional LR distributions may be represented as a T-table. In exactly the 
same way one shows that each triple of conditional quantum experimentalist distributions Qqq, Qoi> fiio can be represented as a 
T-table. It therefore suffices if we can show that every T-table corresponds to some LR theory P K . We show this by considering 
the 16 possible deterministic theories Px^y^- Here T xlX2yiy2 is defined as the theory with P K (Xi =xi,X2 =X2,Y\ =yi,l2 = 
y2) = Kx\X2yiy2 = 1- Each deterministic theory T XlX2yiy2 corresponds to a specific T-table denoted by T X[Xoyiyo . For example, the 
theory Tfftf gives the TFFTF-table shown in Table IXXXII We will prove that the set of T-tables is in fact the convex hull of 
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TABLE XXXI 

Example of a Classical, Deterministic Distribution 



Pr ab (X=x.Y = x) 


0=1 

x = T x — F 


a = 2 

x = T x — F 


b =i y=Z 

y=F 


1 



1 



b =i y=l 

y=F 



1 





the 16 tables Tx^yiyj corresponding to deterministic theories. This shows that any T-table can be reproduced by a mixture of 
deterministic theories. Since every LR theory n £ II can be written as such a mixture, this proves the lemma. 

First we observe that a T-table with all entries or 1 has to be one of the 16 deterministic theories. Given a T-table that is 
not a deterministic theory, we focus on its smallest nonzero entry F ab = e > 0. By the restrictions of Equations (1861 there exists 
a deterministic theory T^ such that the table (T — £r#)/(l — e) has no negative entries. For example, suppose that the smallest 
element in F corresponds to P K (X\ = F,Y\ = T) (denoted as pi in the first table above). By the restrictions of Equation d86l >. 
either the table (r — /?2Lfftf)/(1 — Pi) (where Tfftf is shown above) or one of the three tables (r — /?2Lfftt)/(1 — Pi), 
(r-j> 2 rFTTF)/(l ~Pz), (r-p2rFTTr)/(l -pz) has only nonnegative entries. 

Let r' := (r— er^) /(l — e) where k is chosen such that F' has no negative entries. Clearly, either F' describes a deterministic 
theory with entries and 1, or T' is a T-table with number of nonzero entries one less than that of F. Hence by applying the 

above procedure at most 16 times, we obtain a decomposition F = EiT^, H he^F^^, which shows that F lies in the convex 

hull of the T-tables corresponding to deterministic theories. Hence, any such F can be described as a LR theory. 

For measurement settings with more than two outcomes, the proof can be generalized in a straightforward manner. □ 



B. Proof of Theorem 

Theorem 1: Let Q be a given (not necessarily 2x2x2) nonlocality proof and n the corresponding set of local realist 
theories. 

1) Let U(o,k) :=Z)(Qa||^), then: 

a) For a 2 x 2 x 2 proof, we have that 

U(o,n)= £ o ab D{Q ab {-)\\Pab-A-)) (87) 

a,be{l,l} 

Hence, the KL divergence D(Q a \\P a:7[ ) may alternatively be viewed as the average KL divergence between the 
distributions of (X,Y), where the average is over the settings (A,B). For a generalized nonlocality proof, the 
analogous generalization of Equation d87l > holds. 

b) For fixed a, U(o,n) is convex and lower semicontinuous on n, and continuous and differentiable on the interior 

of n. 

c) If Q is absolutely continuous with respect to some fixed K, then U(o,n) is linear in a. 



2) Let 



U(a):=MU(a,7c), (88) 



then 

a) For all a € £, the infimum in Equation J88i is achieved for some n*. 

b) The function f/(cr) is nonnegative, bounded, concave and continuous on a. 

c) If Q is not a proper nonlocality proof, then U{&) = for all a £ £. If Q is a proper nonlocality proof, then U (<l) > 
for all a in the interior of E. 

d) For a 2 party, 2 measurement settings per party nonlocality proof, we further have that, even if Q is proper, then 
still U (ct) = for all a on the boundary of E. 

3) Suppose that a is in the interior of E, then: 

a) Let Qbea2x2x2 nonlocality proof. Suppose that Q is non-trivial in the sense that, for some a,b, Q ab is not a 
point mass (i.e. < Q a b( x ,y) < 1 f° r some x,y). Then n* £Fl achieves the infimum in Equation J88i if and only 
if the following 16 (in)equalities hold: 

E a =1 (89) 
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for all (xi,x 2 ,yi,y2) € {T,F} 4 such that It^iJun > 0, and 



a,be{\,2) 

for all (xi,x2,yi,y 2 ) & {T,F} 4 such that n* =0. 



E ^ p 6a&( ^\ <l (90) 



ab;n : 



For generalized nonlocality proofs, n* £ II achieves Equation J88i if and only if the corresponding analogues of 
Equations (|89j and (|90j both hold, 
b) Suppose that n* and 7C° both achieve the infimum in Equation ( 188b . Then, for all x,y € {T,F}, a,Z? £ {1,2} with 
Qab{x,y) > 0, we have P a b;it*{x,y) — Pab;x°{x,y) > 0. In words, n* and n° coincide in every measurement setting 
for every measurement outcome that has positive probability according to Q a , and Q is absolutely continuous with 
respect to %* and 71°. 

Proof: We only give proofs for the 2x2x2 case; extension to the general case is entirely straightforward. We define 

U{{a,b),%):=D{Q ab {-)\\P ab .„{-)) (91) 
= E Qab(x,y)[logQ ab {x,y)-logP ah , K (x,y)}. (92) 

.r,>'G{T,F} 

Qab(x,y)>o 

Note that U(a,7t) can be written as U(a,n) — £ a ,&e{i,2} a abU{{a,b), %). 

Part 1: Equation d87l follows directly from the additivity property of KL divergence, Equation J55I . Convexity is immediate 
by Jensen's inequality applied to the logarithm in Equation ( 19 U and the fact that P ab ;7i( x ,y) is linear in Kx^ym f° r eacn 
(xi,x 2 ,yi,y 2 ) S {T, F} 4 . If n lies in the interior of II, then P ab ji(x,y) > for a,b € {1,2} so that U(o,n) is finite. Continuity 
and differentiability are then immediate by continuity and differentiability of logx for x > 0. Lower semicontinuity of t/(<7, it) 
on n is implied by the fact that, on general spaces, D(Q\\P) is jointly lower semi-continuous in Q and P in the weak topology, 
as proved by Posner [29, Theorem 2]. Part 1(c) is immediate. 

Part 2: We have already shown that for fixed a, U(o,n) is lower semicontinuous on IT. Lower semicontinuous functions 
achieve their infimum on a compact domain (see for example [13, page 84]), so that for each a, Equation J88i is achieved for 
some n*. This proves (a). To prove (b), note that nonnegativity of U(o) is immediate by nonnegativity of the KL divergence. 
Boundedness of U(o) follows by considering the uniform distribution n°, with, for all x\ 1 X2 1 yi 1 yi, ^x^ym = 1/16. x° is in 
II, so that 

U(a) < U{a, 7t°) (93) 
= E a °b( E Qab(x,y){logQ ab (x,y)+2}) (94) 

a,be{l.2} V. tjVG {T,F} / 
Q ab (x,y)>0 

<- E OabH(Q ab )+8, (95) 

a,b{\,2} 

where H(Q ab ) is the Shannon-entropy of the distribution Q ab . Boundedness of U(o) now follows from the fact that H(Q) > 
for every distribution Q, which is a standard result (see, e.g. [10]). 

Let a be in the interior of E and let n* E Yl achieve infnf/(c7,7r). Since U(<j) is bounded, Q is absolutely continuous with 
respect to %* (otherwise U (a) = °°, a contradiction). Thus, U(a) satisfies 

U(o)= inf U(a,n), (96) 

where Q <C 71 means that Q is absolutely continuous with respect to n € IT. We already proved that if Q is absolutely continuous 
with respect to n*, then U((7,7i*) is linear in a. Thus, by Equation (I96> . U{u) is an infimum of linear functions, which (by 
a standard result of convex analysis, see e.g. [32]) is concave. A concave and bounded function with a convex domain must 
be continuous on the interior of this domain (see, e.g., [32]). It remains to show that U(o) is continuous at boundary points 
of £. Showing this is straightforward by taking limits (but tedious). We omit the details. 

Now for part (c). If Q is not a proper nonlocality proof, then by definition there exists a TtQ £ Yl such that, for a,b £ {1,2}, 
we have Q ab = P ab -n Q ar, d hence U (<7, 7To) = for all a e E. 

Now suppose Q is a proper nonlocality proof. Let a be in the interior of E. infn£/(<7, k) is achieved for some n* . Suppose, 
by means of contradiction, that U(o, n*) =0. Since a ab > for a,b £ {1,2}, we must have Q ab = P ab -x* f° r a,b £ {1,2}. But 
then Q is not a proper nonlocality proof; contradiction. For part (d), if a is on the boundary of E, then for some a,b, O ab = 0. 
It then follows from Lemma [T] and the fact that, for all P, D(P\\P) = that U(o,%*) = 0. 
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Part 3: Part (a) The condition that Q a y is not a point mass for some a,b, implies that all n* that achieve the infimum 
must have ^ lX3 y iy2 < 1 f° r a U x i, x 2,yi,y2, (otherwise U (a, 7t*) =°°, which is a contradiction). Thus, we assume that n* £ YIq, 
with Ilo the set of ns that satisfy this "< 1" restriction. 
For p e [0,°o) 16 , let 



P m n(P) ■= y ™^ ■ W 

i,/ 1 / 2 ,y 1 ,/ 2 6{T,F}Pjc' 1 4y 1 y 2 

In this way, each vector p with at least one non-zero component uniquely defines a local theory p S n, and 

p:pe [0~) 16 and £ p, lW2 > oj = n . (98) 



Let p* be such that p* achieves the infimum in Equation (88} , Then 2 is absolutely continuous with respect to p*. One 
can now show that for each (xi,X2,yi,y2) & {T,F} 4 , the partial derivative d U(a,p)/dp xltX2yi y 2 evaluated at p = p* exists 
(even if p* { X2 y yi = 0). Since p* achieves the infimum, it follows that, for each {xi,X2,yi,yi) G {T, F} 4 , we must have that 
(d /dp XuX2 _y u y 2 )U(o,p) evaluated at p* is not less than 0, or, equivalently, 

«4 ■( E P m J>0 (99) 

with equality if p* x > 0. Straightforward evaluation of Equation ( 199 \ gives Equations fl89b and ( 1901 . This shows that 
each n* achieving Equation J88t > satisfies Equations J89I and J90b . On the other hand, each %* corresponding to a p* with 
p* = n* such that Equation ( I99> holds for each (xi,X2,yi,.y2) € {T,F} 4 must achieve a local minimum of U(a,Jt) (viewed as 
a function of 7t), Since U{o,n) is convex, n* must achieve the infimum of Equation (1881 . 

For part (b), suppose, by way of contradiction, that for at least one {x\,y\) € {T,F} 2 , ao,£>o € {1,2} with 6 flo i (xi,yi) > 0, 
we have P ao b ;n* (xi,yi) ^ Pa b ;x°(xi,yi)- For each x,y € {T, F},a,Z? e {1,2}, we can write 

Pab;n* (x,y) = + K 2 + %* 3 + K 4 , 

P ab ;Ax,y) = K° k[ + K° k2 + ^ + (100) 

for some k\,...,k^ depending on x,y,a,fe. Here each kj is of the form X1x2.y1.y2 with ■Xi,)'; G {T, F}. Now consider 7t + := 
(l/2)?r* + (1/2)7T°. Clearly 7T + € n. By Jensen's inequality applied to the logarithm and using Equation J100I . we have for 
a,be {1,2}: g fl fc(x,y)[logg ai (x,y) - logP afe;jI +(x,;y)] < g a i(x,y)[loge a6 (x,y) - 5logP flf ,.^(x,y) - ilogP afc; „o(x,y)], where the 
inequality is strict if x = xi ,y =yi,a = oq and b = bo. But then for a,b € {1,2}: t/ ((a,b), n + ) < jU((a,b),K*) + jU((a,b),n°), 
which for (a,b) = (flo,^o) must be strict. By assumption, <T ao fo > 0. But that implies U(a,K + ) <U{o,%*) = infn£/(cr,7r) and 
we have arrived at the desired contradiction. □ 



C. Proofs of Game-Theoretic Theorems 

1) Game-Theoretic Preliminaries: Proposition [2 gives a few standard game-theoretic results (partially copied from [13]). 
We will use these results at several stages in later proofs. 

Proposition 1: Let A and B be arbitrary sets and let L : A x B — > KU {— °o 7 oo} be an arbitrary function on Ax B. We have 

1) inf /3GB sup aGA L(a,^) > sup ceGA inf (3eB L(a,^). 

2) Suppose the following conditions hold: 

a) The game (A,Z?,Z^) has a value V E MU { — o , 00 }, that is infjg^gSup CKGA Z^(oj,j3) = V — sup aGA inf^ G g Z,(cu,j3). 

b) There exists a* that achieves sup aGA inf^ GB L(a, j3). 

c) There exists j3* that achieves inf|3 GB sup Q:GA L(a,j3). 
Then (a*,j3*) is a saddle point and L{a*,P*) = V. 

3) Suppose there exists a pair (a*,/3*) such that 

a) j3* achieves inf |SeB L(a*,j3) and 

b) /3* is an equalizer strategy, that is, there exists a K e RU {— °o,oo} with for all a <E A, L(a,j3*) = /f. 

Then the game (A,B,L) has value /T, i.e. infj3 GB sup aGA L(a,j3) = sup C(GA inf | g GB L(a,j3) and (a*,j3*) is a saddle 
point. 

Proo/:- (1) For all a' € A, 

inf supL(a,j3) > infL(a',j3). (101) 
Therefore, inf (3GB sup aGA L(a,/3) > sup a , eA inf p eB L(a' 
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(2) Under our assumptions, 



L(a*,l5*)< 



supL(a,B*) 


(102) 


aeA 




inf supL(a,j3) 


(103) 


P^B aeA 




V 


(104) 


sup inf L(a,B) 


(105) 


aeA P^B 




ML(a*,B)<L(a*,B*), 


(106) 


peB 





S oL(a*,B*) = V = Mp eB L(a*,B)mdL(a*,B*) = V = S np aeA L(a,B*). 

(3) To show that the game has a value, by (1) it is sufficient to show that inf p eB sup aeA L(a,B) < sup aeA infp eB L(a,l3). 
But this is indeed the case: 

inf supL(a,j3) < supL(a, ]8*) = L(a*,B*) = K = inf L{a*,B) < sup inf L(a,B), (107) 

P^BaeA aeA P^B aeApeB 

where the first equalities follow because B* is an equalizer strategy. Thus, the game has a value equal to K. Since sup a L(a,j3*) = 
K, B* achieves inf^ sup a L(a, j3). Since inf pL(a* ,B) =K, a* achieves sup a inf|3L(a,j3). Therefore, (a*,j3*) is a saddle point. 

□ 

2) Proof of Theorem^ the Saddle Point Theorem for Correlated Settings and Generalized Nonlocality Proofs: 
Theorem 2: For every (generalized) nonlocality proof, the correlated game (IT,E,t/) corresponding to it has a finite value, 
i.e. there exists < V < °° with 

V = infsupt/ (a, n) 
n i 

= supinff/((J, n). 
e n 

The infimum on the first line is achieved for some n* G IT; the supremum on the second line is achieved for some a* in E, 
so that (n* ,a*) is a saddle point. 

Proof: We use the following well-known minimax theorem due to Ferguson. The form in which we state it is a 
straightforward combination of Ferguson's [13] Theorem 1, page 78 and Theorem 2.1, page 85, specialized to the Euclidean 
topology. 

Theorem 4 (Ferguson 1967): Let (A,B,L) be a statistical game where A is a finite set, B is a convex compact subset of Mr 
for some k > and L is such that for all a EA, 

1) L(a,B) is a convex function of B € B. 

2) L(a,B) is lower semicontinuous in B € B. 

Let A be the set of distributions on A and define, for P e A, L(P,B) — EpL(a,j3) = Y.aeAPaL{(% , B). Then the game (A,B,L) 
has a value V, i.e. 

sup inf L(P, B ) = inf sup L(P, B), (108) 
PeAP^B P^B PeA 

and a minimax B* £ B achieving infp eB sup aeA L(a,B) exists. 

By Theorem[0 part (1), U((T,n) = D(Q a \\P a - K ) is lower semicontinuous in K for all a € E. Let us now focus on the case of 
a 2 x 2 x 2 game. We can apply Theorem |H with A = {11, 12,21,22}, A = E and B = IT. It follows that the game (E,IT,f/) 
has a value V, and infn supj; U(o, %) = V is achieved for some n* £ IT. By Theorem [2 part (2), < V < °°, and, since U (cr) 
is continuous in <7, there exists some <7* achieving supj;infnC/(c7, n). 

The proof for generalized nonlocality proofs is completely analogous; we omit details. □ 

3) Proof of Theorem]3\ Saddle Points and Equalizer Strategies for 2x2x2 Nonlocality Proofs: 

Theorem 3: Fix any proper nonlocality proof based on 2 parties with 2 measurement settings per party and let (E.Il.f/) 
and (E UC ,IT,£/) be the corresponding correlated and uncorrelated games, then: 

1) The correlated game has a saddle point with value V > 0. Moreover, 

supinf U(a, n) < sup inf t/(c, n)=V 
£uc n £ n 

inf sup U (a, n) = inf supf/((7, k) = V 
n E uc n £ 

2) Let 

IT* := {n : % achieves inf sup{/((7, n)}, (111) 
n £ 

n uc * :— {n : % achieves inf supf/((7, ft)}, (112) 
n £uc 



(109) 
(110) 
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then 

a) n* is non-empty. 

b) n* = n uc *. 

c) All n* 6 n* are 'equalizer strategies', i.e. for all a £ £,£/(cr, n*) = V. 

3) The uncorrelated game has a saddle point if and only if there exists (7t*,<j*), with a* S E uc , such that 

a) n* achieves infn t/((7*, ft). 

b) ft* is an equalizer strategy. 

If such (a*, 71*) exists, it is a saddle point. 

Proof: The correlated game has a value V by Theorem |2] and V > by Theorem Inequality 1 1091 is immediate. 
Let U((a,b),Jt) be defined as in the proof of Theorem^ Equation ( 19 II . To prove Equation il 10K note that for every igll, 

sup£/(<7,ft) = sup t/ (a, ft) (113) 

£UC £ 

= max U((a,b),n). (114) 
a,6e{l,2} 

Thus, Equation il 1 Oft and part 2(b) of the theorem follow. Part 2(a) is immediate from Theorem|2] To prove part 2(c), suppose, 
by way of contradiction, that there exists a n* G II* that is not an equalizer strategy. Then the set {(a,b) | U((a,b),7t*) = 
max a j 6 {j 2} U((a,b), 7t*)} has less than four elements. By Theorem |2] there exists a a* € E such that (c*,7r*) is a saddle 
point in the correlated game. Since a* E E achieves supj; [/((!, ft*), it follows that for some ao,bo £ {1,2}, o* aljQ = 0. But then 
<7* lies on the boundary of E. By Theorem part 2(d), this is impossible, and we have arrived at the desired contradiction. 

It remains to prove part (3). Part (3), 'if follows directly from Proposition Q To prove part (3), 'only if, suppose the 
uncorrelated game has saddle point (a*, 71*). It is clear that n* achieves infn U(o* 1 n). We have already shown above that K* 
is an equalizer strategy. □ 



